Why Is Schematron Different?

So Schematron is so simple because it utilizes XPaths to do all of the heavy lifting. Xpaths allow you to navigate through XML documents using simple queries that look like directory paths or URL: e.g. /book/chapter[1]/title or /observation/heart-rate

Not only does Schematron use XPaths to specify the constraints but also to determine the context locations for the test: no “grammar” like a DTD is used. Contrast this with the approach of using XPaths to test assertions from contexts specified by a grammar (the idea originates in Dave Raggett’s Assertion Grammars and has found its way back into W3C XML Schemas) on one hand, or the approach of merely having lengthy absolute Xpaths (which some testing applications have adopted) on the other hand. Schematron is a much simpler technology, and yet the most powerful.

Schematron also provides many value-adding features that dramatically increase its practical value.Here are some of the main ones:

Human

Assertions

Schematron is a schema language because it starts with you making clear positive statements in your own human language about what should be found in your document and why (for example, an A should have a B because of C): you capture these assertion statements into rule elements that provide the XPath context for A and assert elements that provide the test XPath for B. So you decide what the text is that the user will be presented with: not some cryptic technical boilerplate message generated in terms of grammar theory or types or element names. For example:

<rule context="/html/head">
  <assert test=
  "string-length(meta[@name="DC.date"]/@content) &gt; 10">
  Our HTML documents need a Dublin Core metadata DATE field,
  because of our corporate document retention policy.
  </assert>
</rule>

In this example, the text message has really nothing to do with the actual XPaths: it gives the message in the best terms for the user to capture the specific business or technical requirement. Note that the assertion statement says what should be found: Schematron assertions are to capture what is supposed to be, not just give error messages like “Ooops I found something bad”.

Dynamic Text

Schematron allows you to pull in any data values into your assertion statements, using the value-of and name elements. So you can construct very specific and useful messages.

Documentation

The title, p and emph elements allow you document your Schematron schema to frame the assertion statements, so that they can be printed fairly directly out, as text or HTML say. (Contrast this with conventional schema languages where there is no necessary explanation of the functionality or intent of content models that can be readily printed.)

Diagnostics

Schematron allows you to further specify a set of diagnostic messages for your assertions, using the diagnostic element. So you can put into the diagnostics all sorts of hint messages and information, rather than diluting the simple clear assertion text. An assert or report element specifies the id of a diagnostics element, and this element is evaluated and added to the validation report that is put out. The text can be dynamically constructed including values calculated from the document itself, it is not just static text. Diagnostics are ultimately intended for reading by humans; to provide additional information intended to feed downsteam processing, consider using the new property element.

Localization

The diagnostic elements may be assigned to different human languages, allowing Schematron outputs in different languages without needing a resource substitution framework. (Plus our Schematron Skeleton implementation provides built-in error messages in multiple languages, for developers.)

Role

The role attribute lets you assign some arbitrary label to each rule or assertion. The most common use of these is to assign the traditional weights to each assertion of FATAL, ERROR, WARNING, INFO. Roles could also be provided for the OBSOLETE and OBSOLESCENT distinction that may be in effect.

See

The see attribute allows you to provide a hypertext link to documentation or related material relevant to each pattern, rule or assertion. The Schematron schema can therefore be integrated into your wider information system.

Let

For developers, long XPaths can be difficult. The let element allows developers to break up their long Xpaths into shorter, named variables that can make testing and inspection easier.

Rich Text

In the message of assertions and diagnostics, there are muliple elements allowed to provide rich text, similar to HTML:

emph allows emphasized phrases
span allows phrases to be specially marked
dir allow explicit markup of directionality (for Middle Eastern script)
@lang allows the language of the text to be specified
@icon allows a link to a small graphic, for a richer non-textual experience.

Dynamic

Phase

All other schema languages are based on or enforce the rather bizarre assumptions that times do not change, that schemas do not develop, that all errors are of equal severity, that there is no pipeline of incremental processing, and that we always want to test everything at once (and perhaps always halt after the first error.) Schematron gives first class support to variations of schemas.

The phase element allows you to name and select any group of patterns. You can specify in the command-line parameter or invocation phase you want to validate using. Only the active patterns that belong to that phase will be run.

For example, you might have a workflow where your initial automated conversion is expected to get some metadata and basic paragraph and table stuctures correct, but subsequent manual operations will correct the semantic markup, and a final automated process will add some kinds of links. So you could decide to have four phases:

The first phase might only validate the metadata and basic paragraph and table structures. If there is an error, the input document is sent for special input handling.
The second phase might attempt to report on likely elements where semantic markup ,ight be expected, to help the manual process
The third phase might validate the semantic markup
The fourth phase might validate the links. It would also run the patterns for the first and third phases, to give a complete validation.

Another example might be where there has been a revision of the schema, and some documents in your collection use the old schema and some use the new schema. These could be modelled as two phases, an the appropriate phase run for each document.

Another example might be where you first want to validate the incoming document using a few coarse and efficient tests perhaps even allowing false negatives (in one phase) and then validate the documents that failed with more complex or slow tests (in another phase).

Schematron does not provide any mechanism for sequencing phases or the patterns in a phase. These must be coded in the script or application that runs the Schematron engine.

Parameters

Schematron supports passing of strings through from the invoking script or application. These may include dates, verbosity switches, threshhold numbers, names, or anything that can be tested in the XPaths.

Dynamic Text

Schematron allows you to pull in any data values into your generated assertion statements, using the value-of and name elements. So you can construct very specific and useful messages. The value-of element evaluates an XPath, which has full access to the current document, in-scope variables, and SOA resources available using HTTP GET. The name element allows you to get the name of the current context element, which is useful where the assertion can be run in a variety of contexts but you want your validation reports to have specific details which element was involved.

Integratable

Report

The report element can be used where the assert element can be: it is like an upside-down assert: assert generates an output where its test fails, but reports generates an output where its test succeeds. So you can use report to detect features or where you want to report that some error has occurred.

A common error for new developers is to use the assert element but with text that is not an assertion of what is expected: they are not using Schematron to create schemas but to specify tests. In this scenario, the report element is appropriate. For example, not <assert test="not( frog )">ERROR: I found a frog!!!!</assert> but either <assert test=" not( frog )" role="error">This element should not contain a frog</assert> or <report test="frog">I found a frog</report>

The report element allows many attributes (see next) that can provide downstream or calling applications useful information.

SVRL

The ISO standard for Schematron also specifies a simple XML output format for the results of validation, the Schematron Validation Report Language (SVRL). This provides lists of the assert elements that failed and the report elements that succeeded, along with details of which rule elements were filed. Typically these results include XPaths to allow linking of results to the original document.

All other standard schema languages strand their users and require them to use custom functionality, which often does not exist. For example, W3C XML Schemas defines a “Post Schema Validation Infoset” yet this information is not available using any standard XML tools. In effect, you are limited to whatever tools and functionality the vendors have chosen to expose.

Locations

The svrl:failed-assert/@location attribute allows the SVRL report to have an XPath to the original document locating the context node.

Role

See

The see attribute allows you to provide a hypertext link to documentation or related material relevant to each pattern, rule or assertion. The Schematron schema and reports can therefore be integrated into your wider information system.

Flag

The assert/@flag attribute provides a way for multiple assertion to declare that some feature has been found in a document. A flag attribute either is raised or does not exist. Multiple assertions can raise the same flag, but the SVRL report will only mention one. Flags can simplify the processing of the validation report: instead of requiring OR- logic to say “Did assertion a1, a6 or a7 fail?” you can have those assertions raise the particular flag.

Properties

The properties/property element is a container for property elements. You can specify that a failed assertion or successful report has some properties, and these properties will be calculated and added to the SVRL results (similar to how diagnostics are added.) This allows you to add name/value pair properties for downstream processing.

This is a new feature with ISO Schematron 2016. Several users reported that because assert and report elements can only contain text or some simple formatting elements, they needed to create home-made little languages in the assertions to embed data for downstream automated processing. So instead of using <assert ...>WARN|XBF|123|binglybong|There should be a gruntfuttock</assert> you can have <assert role="WARN" properties="TLA code sillyword"">There should be a gruntfuttock</assert> coupled with (at the end of the schema): <properties > <property id="TLA">XBF</property> <property id="code">123</property> <property id="sillyword">binglybong<property> </properties>

Properties can be used to provide static information, such as type annotations, but also can be dynamically generated using the value-of element and full access by XPath to the input document, variables and to the URLs. Properties are intended for consumption by downstream automated processors; if you are constructing messages intended for immedidate human consumption, consider using the diagnostics element.

Dynamic Text

Schematron allows you to pull in any data values into your generated generated assertion statements, diagnostics and property values, using the value-of and name elements. So you can construct very specific and useful text. The value-of element evaluates an XPath, which has full access to the current document, in-scope variables, and SOA resources available using HTTP GET. The name element allows you to get the name of the current context element.

Modelling

Patterns

The pattern element is the primary abstaction in Schematron. We are not checking types, we are checking for the presence or absence of patterns in documents. A pattern is a series of rules, where any node in the XML document will only match one rule (the first that matches); in each rule a bag of assertions will be tested. So patterns can be larger than just a single element.

For an example of a pattern, consider the “table”. A table in conventional schema languages is just an undifferentiated bag of declarations for the individual elements: there may be declarations for an element type called “table” but there is no way to group the whole pattern together. In Schematron, by contrast, we can define a pattern called table, and declare rules for the tables, sections, colspecs, rows, and entries. We can declare assertions to tie in the number of allow entries in a row to the numbers specified by the column specification elements. We can specify constraints that adhere to descendents such as that a table may not contain a table underneath it anywhere. The assertion texts can speak in terms of “columns” even though there is no columns in the data model or markup as such.

Conventional schema languages utterly fail to model patterns in documents that involve more than parent/child relationships.

Abstract Rules

The rule element can have an attribute abstract="true" and an id attribute, instead of a context specification. Other rules can then use the extends element to pull in (mix-in) those constraints. (In XML Schema term, an abstract rule is a base type which can be further restricted.)

Abstract Patterns

The pattern element can have an attribute abstract="true". It is used by declaring some real pattern using the is-a attribute, and supplying parameters that will be substituted into the appropriate locations of the abstract pattern. So an abstract pattern can be customized to specify the same pattern that has different markup representations. For example, our documents may have normal tables, but also some elements with specific names that in fact have the same structures as tables: we make an abstract pattern for tables in general (“A table must have rows; rows must have entries” etc.) and the provide the particular Xpaths when declare the concrete pattern. Abstract patterns can also be used where there are localized versions of schemas, for example the same schema translated to use Chinese character markup.

Subject

The rule/@subject attribute allows you to specify a path to some node in the document that you want to mark as the subject of the assertions. Sometimes the context node specified by the rule element may just be some convenient element for a workable XPath, rather than the actual subject. The SVRL validation report will have the XPath to the subject in its location attribute.

Efficiency

Rule Contexts

Efficiency Schematron largely depends on the underlying XPath or XSLT engine being used. But Schematron is designed to be much more efficient than some other subsequent XPath test languages which only provide a single absolute XPath to test.

First, Schematron allows multiple assertions to be tested for each rule that matches a context.
Second, Schematron rules form if-then-else chains, so that a more specific rule can mask a less specific rule, avoiding unnecessary tests and elaborate negative predicates.
Third, by using a wildcard in the last rule/@context of the pattern, you can catch all the nodes that were missed by the previous rules: much more efficient than elaborating and excluding previously caught nodes.

WARNING: XSLT engines vary enormously in their efficiency or speed: in some cases by 20:1. Some engines will have a performance explosion on the descendent, following or preceding axes. So if you have real-time constraints, do not choose an XSLT engine merely because it is bundled with the platform. HINT: If you are using Schematron called from some scripts and running an XSTL engine using Java, where the time of starting up the JVM may be grater than than the time of validation, you may consider adopting some technology like nail-gun.

Variables

The let element allows variables to be declared. These are scoped to parent element and cannot be overwritten during that scope. Variables allow initial portions of long repeated paths to be evaluated only once. Variable evaluation is usually lazy, meaning that if the variable I never needed, it is not evaluated.

Query Language Binding

The /schema/@queryBinding attribute allows you to select which query language to use inside attributes such as rule/@context and assert/@test/ Typically these would be “xpath“, or “xslt1” or “xslt2“. Selecting XSLT2 unlocks many of the high-efficiency functions supported there, such as full regular expressions.

Keys

If you are using an XSLT Query Language Binding, then the xsl:key element is available to assign more efficient random access using keys.

Phases

The phase element let you select a subset of patterns and only validate against them. So you only need to validate at any stage of a workflow the specific patterns that are relevant at that stage.

Implementation Features

Schematron implementations may support extra options to tailor the execution. For example, an option to terminate validation after the first failed assertion is found. Or to skip checking attribute nodes to see if they match any rule context.

HINT: Schematron is great for capturing requirements and getting a working initial version of your validation up fast, even if your unusual real-time constraints then force you to reimplement some more targeted hard-coded system.

Custom Functions

Most implementations of Schematron sit on top of an XSLT1 or XSLT2 implementation. Indeed, many XSLT1 implementations support EXSLT libraries, which provide some of the functionality of that XSLT2 brings. XSLT2 allows you define your own custom functions, and most XSLT implementations also provide some custom functions, and allow you to invoke some functions from the platform: one common use is to call ODBC or JDBC to talk diretly to a DBMS.

Schematron inherits and allows this: of course, if you do it, you lose platform-independence, and you may then need developers who both know Schematron/XPath and the platform language being used. But if there are real-time constraints that need to be satisfied, or particular tests that need to be performed, it may be necessary.

Schematron is a Web Technology

Schematron variables can be loaded with XML data retrieved from some server using the doc() and document() XPath functions. Any simple web services that returns and XML document in repsonse to an HTTP GET request can be connected to (i.e. POX not SOAP). The Schematron script can construct the URL and provide whatever URL parameters are needed. For example, instead of loading a large XML lookup table, create a web service that provides access to REDIS: this removes the killer loading latency for large files.

In general, it is better to use these kinds of web services rather than including Java or .NET code into the Schematron schema. Web services provide a separation of concerns and may regulate access better.

Easy

XPaths

Schematron itself is so simple in part because the heavy lifting is done by expressions in an embedded query language: typically this is XSLT1 or XSLT2, but there have been Schematrons using many other XML query languages, such as XQuery, STaX streaming, RDF queries and even CSS/JavaScript.

In fact, the best way to become a Schematron guru is to get a really good understanding of XPath!

Namespaces

XML Namespaces are a continual headache for developers: they never work exactly the way you think. And the scoping rules mean that in order to check an instance document, you may have to check every ancestor element to find which prefixes are being bound to which namespaces, or if default namespaces are in use. XML Schemas uses a different rule to figure out which namespaces are in use for declarations to XSLT which is different from general documents. (In fact, there is no general W3C standard for knowing how bind “QNames in attributes or content” to namespaces.)

So Schematron cuts through this complexity: the only place that a namespace prefix used in an XPath can be declared is in the /schema/ns elements. There is no use of defaulting, no need to look up an ancestor hierarchy or track through multiple declarations. Variables

Complex XPaths can be evaluated in sections using named variables. This helps the next reader of the schema, it also improves efficiency, re-use and debugging.

Fallthrough and Wildcards

Rule contexts are evaluated in the order they are written. This makes editing and understanding easy.

Within each pattern elements, the context selection of the rule elements act as an if-then-else chain (or “switch”). So a node in a document will match at most one rule. This can dramatically simplify the XPaths, because a rule context does not need to have predicates to avoid nodes that have been selected by a previous rule: for example, if one rule context is “person[@age <= 16 or @certify = 'NSW4C']” then the next rule does not need to be “person[@age > 16][not(@certify ='NSW4C')]” it can just be just “person“.

XPath provides a wildcard pattern “*” and “@*”. You can use this to match any node that has not matched previous rule contexts. One use for this may be to report unexpected elements or attributes, which may indicate a problem in the document (an error or evolution) or perhaps some shortcoming of your schema.