Converting XML Schemas to Schematron: (#8) Progressive validation for complex content models

This article appeared in a blog on O'Reilly on November 2, 2007.

Now we come to the most interesting part: how do we generate Schematron schemas that implement the constraints from an XML Schema? A question often comes up, of whether Schematron is strictly more powerful than XML Schemas or just often so; some academics have made tentative opinions, and the conclusion I had reached was that probably it was not: for any implementation in Schematron you could probably make a content model that was so baroque and monstrous that Schematron would not capture some aspect of it. But it you would have to try hard.

However, with Schematron using XPath2 or XSLT2 as their query language (rather than the default XSLT1) things are much clearer: I think there is a really simple technique available that captures all the cardinality, optionality, and sequence constraints.

The Regular Expression technique

The technique? Convert a content model into a regular expression; make a string contain as space-separated tokens each element name found in the instance; then validate that string against the regular expression! The regular expression language used in XSLT2 allows sequence, choice, cardinality, repetition, and these are the same as in XSD. You don’t need a special FDA library if you have the regular expression library.

For an implementation of this approach, see Can Schematron use grammars as assertion tests?

So, if we wanted to, we could implement this in our XSD to Schematron converter and say hooray.

(The special cases of xsi:type and substitution groups are no problem: xsi:type could be handled by another pattern because the element must be still valid against the declared type; while substitution groups can be handled during the prior pre-processing and be long gone by this stage. nillibility is not something I have thought about much: it certainly can be done but I don’t know the impact on the Xpaths.)

But the problem is that even though we could validate all the constraints, we would get lousy diagnostics. What would be the point? I guess it would still be useful as a fallback, as another phase for confidence building and to check if anything had fallen between the cracks of the method we will be using, but it is not so interesting to me that we have it in our plans to implement. If someone else wants to implement it in XSLT and contribute it, it might be a fun and small project!

We could break the regular expression in various interesting ways, however: we could make one version of it that made everything optional, and so implement feasible validation, as found for example in Jing for RELAX NG. (However, there might be ambiguity issues here, so it might not work each time.) Feasible validation is an approach I came up with a couple of years ago, based on the idea that it can be useful to validate only certain constraints: very often you might markup a document to fill in the metadata at the last stage, so you don’t want validation to fail because of some problem at the start of an element’s children when you are working on subsequent elements. A validator should not dictate a workflow!

Validators in editors frequently implement partial validation, where they don’t complain about child elements missing at the end of a content model. This is partial validation: it is useful if you are entering the document in element order, but not otherwise.

Now another approach with the regular expression method would be to break it into smaller expressions, for example a string of three tokens with anything allowed before or after: trigrams/ Indeed, that is something pretty similar to what we do later, in effect, but not using regular expressions.

A more Schematron-ish way

So what is the Schematron-ish way to approach the problem? Well, it is to concentrate on two things: first, What is the most useful way of expressing and organizing diagnostics to help the user? and second, What is the model of user interaction built into the schema? Actually, in my opinion, you cannot answer the first without answering the second, and the second dictates the first.

Rather than talk theory, I’ll show you the approach and you should be able to figure out what I mean by user interaction and so on, with these use cases.

Use Cases

The user wants to check for typos: names that are spelled incorrectly
The user wants to check for containment: that elements and attribute belong to the correct parents
The user wants to check that all required elements and attributes are present
The user wants to check that each element is in the required position

This is another example of progressive validation. It allows the user to systematically find certain kinds of mistakes, and partitions them off. Because Schematron will usually report all the errors it finds anywhere in a document, it has an advantage that it is very easy to see systematic errors, if they are presented together; grammar-based validators often just die at the first error. But assertion-based schemas using paths may generated too many diagnostics, as the same error causes multiple assertions to fail.

So Schematron has a feature called phases. Phases let you group some patterns together, give them a name, and then you can instruct the validator to only validate the patterns in the that phase. This allows workflows, progressive validation, incremental markup, transformation checking, variant document types, and so on. Very useful.

Each of these use-cases may take one or more patterns to implement, however, we will make a phase for each of them. (Actually, we have gone phase craaazy, which will be in a later posting.) Here is the phase declarations to validate just the typos, for example:

<sch:phase id="phase-typo">

        <sch:active pattern="Element_Name_Typo">

                                Pattern for checking for typos in element names.

        </sch:active>

        <sch:active pattern="Attribute_Name_Typo">

                                Pattern for checking for typos in attribute names.

        </sch:active>

        <sch:p>This phase has all the patterns for checking typos in names.

</sch:phase>

As you can see, we are not validating using a state machine or similar grammar system at all.

The patterns

Here are the patterns; we have factored out the guts to make the commonality between this boilerplate more obvious.

<!-- pattern 5: Element name typos Elements    -->

<xsl:comment>

                        ============================================================

                                          ELEMENT NAMES 

                        ============================================================

</xsl:comment>

<sch:pattern id="Element_Name_Typo">

        <sch:title>Typos in Element names

        <xsl:call-template name="generate-elements-typo-checking-rule"/>

</sch:pattern>

<sch:pattern id="Element_Name_Expected">

        <sch:title>Expected in Element names

        <xsl:call-template name="generate-elements-expected-checking-rule"/>

</sch:pattern>

<sch:pattern id="Element_Name_Required">

        <sch:title>Required in Element names

        <xsl:call-template name="generate-elements-required-checking-rule"/>

</sch:pattern>

<!-- pattern 6: Attributes name typos Attributes     -->

<xsl:comment>

                        ============================================================

                                 Attributes NAMES

                        ============================================================

</xsl:comment>

<sch:pattern id="Attribute_Name_Typo">

        <sch:title>Typos in Attributes names

        <xsl:call-template name="generate-attributes-typo-checking-rule"/>

</sch:pattern>

<sch:pattern id="Attribute_Name_Expected">

        <sch:title>Expected in Attributes names

        <xsl:call-template name="generate-attributes-expected-checking-rule"/>

</sch:pattern>

<sch:pattern id="Attribute_Name_Required">

        <sch:title>Required in Attributes names

        <xsl:call-template name="generate-attributes-required-checking-rule"/>

</sch:pattern>

<xsl:comment>

Typos

The typo patterns are very easy. Here is the one for elements.

<xsl:template name="generate-elements-typo-checking-rule">

        <xsl:for-each select="//xs:element[@name]">

                <xsl:sort select="@name"/>

                <sch:rule context="{@name}">

                        <sch:assert test="true()">
                        The <sch:name/> element is defined in this schema.</sch:assert>

                </sch:rule>

        </xsl:for-each>

        <sch:rule context="*">

                <sch:report test="true()" diagnostics="typo-element">
                Only elements declared in the schema may be used.</sch:report>

        </sch:rule>

</xsl:template>

In this case, we generate a rule for each element, but with only a vacuous true() assertion test; there is still a useful assertion behind it though, in the assertion text. Elements with typos fall through and are caught by the wildcard test of the last rule.

And finally, here is a simple diagnostics element, to report the miscreant.

<sch:diagnostic id="typo-element">
                The following element was found <sch:name/>.
        </sch:diagnostic>

We’ll continue with handing more of the use cases in another blog.