What kinds of content models (as regex) do users of conventional schema languages use?

Posted on July 25, 2018 by Rick Jelliffe

I really enjoyed the paper from IDEAS 2018 Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema Data by Li, Chu, Mou, Dong, Chen.

My interest comes from my attempts years ago to make an XSD-to-Schematron translator, sponsered by JSTOR. It worked as far as it went, and fortunately there were numerous edge cases of XSD that we did not have to support. The code is still up on GitHub schematron site. In particular, the thing I was concerned about was that, while we can validate the children of an element by making a list of tokens of the element names and validating against a regex generated from the content model, that method doesn’t allow good individual diagnostics: you just get yes/no for validity, which kinda defeats the purpose of validation. (I am looking forward to reading Michael Kay’s paper this year on how he has recently implemented XSD in XSLT.)

So what I wanted to do was to break up the content model into pairs where a ::= (b,c |d) would be
<rule context="a/b">
<assert test="following-sibling::*[1][self::c or self::d]"> In an "a", a "b" should be followed by a "c" or "d"</assert>
</rule>

Now that would be fine for content models where an identifier only was used once (i.e not a:”= (b,b|c|d) ), which is called a Single Occurrence Regular Expression (SORE). (Actually, it would also work if each separate occurrence was the same, i.e. a::=b+ or a::=b,c,b,c) And my problem was this: how many content models were SORE or some sub-or-superclas of SORE; if almost were, then converting the content model to pairwise tests would work well: a non-SORE content model would have the unattractive regex validation to fall back on; but if few were, you would just get a horrid binary validator.

So the Li et al paper is interesting because it sucks up a large number of grammar-based schemas on the internet and checks them for their regex class. It shows what you might expect: the models XSD schemas rarely use anything more than SOREs (99.6%), the RELAX NG do quite a bit (96.2%), while (surprising) it is common in DTDs (93.4%). This suggests that in fact the pairwise implementation of XSD-to-Schematron (or XSD to XSLT) would be entirely satisfactory, as a heuristic.

In fact, most of these non-SOREs are just 1-OREs, which suggests that a small amount context might help. For example, a::=(b, (c|d), e, (b|c)) would be two patterns, one with@context a/b[not(preceding-sibling::c)]and the other a/b[preceding-sibling::c]

The paper counts lots of other subclasses of SORE, which are interesting too. The open question is to what extent the differences are attributeable to schema language capability, developer approach or insight, ease of tools, or relate to the data domain itself (if you are serializing out a database row, of course your content model will be SORE!)

The other point in the paper is they make a SchemaRank index, where they work out the schemas with the largest number of instances on the WWW. (Of course, the largest corpus of XML d7uments are not on the internet, such as LexisNexis or the other big legal companies, so it is skewed to public data.) What are the most referenced schemas?

HL7 health
Netex public transport information
OMG Clinical decision support vocabulary
Amazon base
GML
Cafe con leche’s bible schema got into the top 10, suggesting the SchemaRank is not particularly useful (not because the Bible or schema are not useful, of course, but because it shows that documents that use schemas are usually in small specialized domains, as you would expect.)