Converting XML Schemas to Schematron: (#9) Friendlier schemas

This article first appeared in a blog on O'Reilly on January 30, 2008.

We can improve on the diagnostics given by the rules in the previous article in this series, Progressive Validation of Complex Content Models.

Diagnosing Similar Names

One of the most common typos is simply to make a mistake in upper-case/lower-case. We can generate Schematron code to check this:

<sch:rule context="*[upper-case(local-name())=upper-case('Address')]">
         <sch:report test="true()">The unexpected element "<sch:name/>" has been used,
            which is close to an element in the schema: the element "Address".

And here is the XSLT for generating those Schematron rules:

<xsl:for-each select="//xs:element[@name]">
                <xsl:sort select="@name"/>
                <xsl:variable name="theLocalName" select="replace( @name, '^(.*):(.*)', '$2' )" />
                <xsl:if test="string-length( $theLocalName ) > 0">
                        <sch:rule context=
                                <sch:report test="true()" role="note"
                                >The unexpected element "<sch:name/>" has been used, which is close to an
                                element in the schema: the element "<xsl:value-of select="@name"/>"
                                <xsl:if test="contains(@name, ':')"> in the
                                {<xsl:value-of select="ancestor::xs:schema/@targetNamespace"/>} namespace</xsl:if>.

This code actually catches two problems: have you made an upper-/lower-case typo or have you used an element with a name in the current namespace but using a different namespace.

Actually, the code as it is will generate a false positive if the same element name is used in multiple namespaces. So I will give it a role attribute of “Note” (as in Note, Caution, Warning). The role attribute lets you know what function a particular assertion plays in its rule or pattern.

These generated rules get put in the pattern that checks for typos, after the checks for defined names, but before the wildcard catch-all entry at the end: this way elements that have correct names and namespaces are dealt with before these rules, and any names that have other problems get dealt with by the default. In Schematron, a schema is made from patterns: each pattern contains rules, and each rules contains assertions (assert or report elements): every assertion in a rule is tested in the context (an XPath that may match nodes of interest from the document) provided by the rule; the rules however form a case statement, so that if some node matches one rule they won’t be tested by a subsequent rule in the same pattern.

Towards terser, more declarative schemas

It is almost axiomatic that automatically generated code is ugly and unfriendly. Look at compiler generators for example. Of course, getting consistent code that does the same thing many times is why you use a code generator like Schematron in the first place rather than writing the XSLT yourself, in many cases.

But it is certainly possible to make the code more friendly and more declarative. In Converting Schematron to XML Schemas I showed how to use abstract rules to provide extra declarative information so that there is enough information to convert back to a kind of W3C XML Schema. It doesn’t go so far, but the idea is that abstract rules (and abstract patterns, together with the role attribute) provide the abstraction for grouping assertions and representing types.

I won’t go into the code, it is trivial, but the idea is that there are quite a few rules or assertions that don’t have any dynamic content (sometimes it is handled by the diagnostic element, other times we don’t expect the rule to ever generate messages, see Expressing untested and untestable constraints in Schematron) and we can use abstract patterns to make things much more declative, readable and terse.

Here is an example, for the rules that swallow elements names that are defined in the current namespace

<sch:rule id="DefinedElement" abstract="true">
         <sch:assert test="true()">The element name "<sch:name/>" is defined.</sch:assert>

      <sch:rule context="Address">
         <sch:extends rule="DefinedElement"/>

      <sch:rule context="AgeNextBirthday">
         <sch:extends rule="DefinedElement"/>

And here is an example for detecting various kinds of text content:

<sch:rule abstract="true" id="NoDataContent-ns1">
         <sch:assert test="string-length(normalize-space(string-join(text(), ''))) = 0"
                     diagnostics="d1">Element "<sch:name/>" should have no text content.</sch:assert>

      <sch:rule abstract="true" id="NoElementContent-ns1">
         <sch:assert test="count(*|processing-instruction()|comment()) = 0" diagnostics="d1
         ">Element "<sch:name/>" should be completely empty (no XML comments, PIs, or elements).</sch:assert>

      <sch:rule abstract="true" id="NoContents-ns1">
         <sch:extends rule="NoDataContent-ns1"/>
         <sch:extends rule="NoDataContent-ns1"/>
         <sch:assert test="count(processing-instruction()|comment()) = 0" diagnostics="d1"
                >Element "<sch:name/>" should be completely empty (no XML comments, PIs).</sch:assert>

      <sch:rule context="BestTime">
         <sch:extends rule="NoElementContent-ns1"/>

      <sch:rule context="Gender">
         <sch:extends rule="NoDataContent-ns1"/>

      <sch:rule context="Female">
         <sch:extends rule="NoContents-ns1"/>

      <sch:rule context="Male">
         <sch:extends rule="NoContents-ns1"/>

Much easier to read than having all those assertions expanded!