Validating complex text, embedded in XML without markup, in Schematron

Posted on November 7, 2017 by Rick Jelliffe

Your XML documents have large chunks of text in some fixed notation:  XPaths or JavaScript or dates that are not limited to ISO 8601 or even, because you are so perverse, JSON fragments. How are you going to validate that XML?

You are not going to mark these up with element tags, unless you want to annotate the subsections: for processing, there will be libraries to parse and process. But that leaves us in the dark for validation. You are not going to mark it up just to get validation.

There have been many ways to cope with this over the years.  SGML provided a way to declare that some text belonged to a NOTATION, to identify it, but left it at that.  XML Schemas allows you to contrain simple data types with its lovely datatypes system: for complex types it gives you regular expressions.  RELAX NG is the same.

However, real text notations quickly become too complex for simple regular expressions. Both technically (the kinds of languages you can represent) and mechanically (how spaghetti-like it is.) Plus you are prevented from validating data inside mixed content.  Plus you cannot validated co-constraints where some parts of the constraint are in the outside XML and some parts are in the inside XML.  Plus you get the lofty technobabble of the errors these libraries generate. Not suitable for human consumption.

One technique for coping with this with standard XSLT tools is to preparse the data in the foreign notation (stripping out any annoting elements) to split out the parts and somehow put them inline: then you have XML.  But this is clunky and difficult to describe, and to do even in XSLT 2.0 with its better text-parsing capabilities. So a more palatable

Jirka Kosek has a great talk up from XML London 2007 on integrating REx parsing into Schematron, using the <sch:extends href="... element of Schematron 2nd Edition 2017 to include pre-compiled XSLT parsers to give an all-XSLT approach, without resorting to foreign functions. I recommend the video and the technique. The same technique can be applied to other parsers than REx of course.  (Don’t look how Jirka phrases his assertion text however: not the point! :) )

Schematron provides several features that are easily implemented as macro-inclusion facilities: phases, abstract rules, abstract patterns, includes and this new sch:extends[@href]. The difference between include and extends for a foreign file is that sch:include[@href] inserts a n external XML file whole (no good if you need a sequence of elements), while sch:extends[@href] inserts the children of the root element of the external XML file.