Can Schematron use grammars as assertion tests?

This article was originally in a blog on O'Reilly on April 19, 2010.

Every schema language does not need to handle every kind of constraint equally well. But in the case of Schematron, I sometimes read the comment that grammars are much easier for some kinds of constraints. And that grammars are more declarative and so help building forms and syntax-directed editors.

However, it is possible to do grammars in Schematron. I have circled around this several times before (for example when discussing converting XML Schemas to Schematron) because the trouble with grammars is that often don't give very satisfactory user messages. But after seeing a spate of papers claiming that Schematron was not as good as other schema languages because it could not represent grammars, I think it might be a good idea to put the code up, in case it is not so clear.

So here is a little schematron schema that implements the regular grammar that would have the DTD of

<!ELEMENT x  ( a, b, c?) >

<pattern>
   <rule context="x">
     <let name="grammar" value=" 'a b( c)*' " />
     <let name="contents"
        value="string-join(for $e in * return  local-name ( $e ), ' ') " />
     <assert test="matches( $contents, $grammar )"
     >The contents [<value-of select="$contents"/>] 
     should match grammar [<value-of select="$grammar"/>] </assert>
   </rule>
 </pattern>

The variable grammar holds a string that is a regular expression. The variable contents is a string made from all the names of the child elements of the context, separated by a space. Validation is by simple string matching.

The regular expression can be as complicated as desired: indeed, the regular expression library in XSLT provide quite a lot more sophistication for various kinds of matches (especially with wildcards) compared to other grammar-based schema languages. Of course, these regular expressions are not tokenized like RELAX NG Compact or DTDs: the space is significant so a b( c)? is different from a b c?. (If it were desired to have a syntax more like RELAX NG Compact, you could define an XSLT2 function to rewrite that into this regex form. The assertion would merely become something like matches( $contents, my:rewrite( $grammar )).)

The other wrinkle to handle would be namespaces. Not knowing which prefix had been used for an element name makes this a little more complex. A few solutions suggest themselves: wildcarding the names like (\S+:)?a (\S+:)?b( (\S+:)?c)?, or rewriting the $contents and the grammar to use James Clark's notation.

This kind of grammar is just a regular expression, and unlike most XML grammars ( tree-regular grammars) a particular in the grammar is not a reference or declaration to any subgrammar for its contents. The only way an assertion can be tested is if it rule is fired, when the rule's context matches something in the document under consideration.

So validating elements with regular expressions in Schematron is almost trivially easy, but with the penalty being ungainly syntax for complex regular expressions rather than a lack of power. The contents of the $grammar field could be used to drive forms or GUI construction systems or for data-binding if required. And a complex regular expression may be an indication that a grammar is the wrong tool for the job: in which case the other capabilities of XPath are available.

[Update: Note that because W3C XML Schema has ancestor-based typing (see page 66) it is possible to use XPaths for both the type assignment (saying which regular expression an element's contents should conform to) and the validation (using the technique above) of an element.]