So Schematron is so simple because it utilizes XPaths to do all of the heavy lifting.
Xpaths allow you to navigate through XML documents using simple queries that look
like directory paths or URL: e.g. /book/chapter[1]/title
or /observation/heart-rate
Not only does Schematron use XPaths to specify the constraints but also to determine the context locations for the test: no “grammar” like a DTD is used. Contrast this with the approach of using XPaths to test assertions from contexts specified by a grammar (the idea originates in Dave Raggett’s Assertion Grammars and has found its way back into W3C XML Schemas) on one hand, or the approach of merely having lengthy absolute Xpaths (which some testing applications have adopted) on the other hand. Schematron is a much simpler technology, and yet the most powerful.
Schematron also provides many value-adding features that dramatically increase its practical value.Here are some of the main ones:
Human
Assertions
Schematron is a schema language because it starts with you making clear positive statements
in your own human language about what should be found in your document and why (for
example, an A should have a B because of C): you capture these assertion statements into rule
elements that provide the XPath context for A and assert
elements that provide the test XPath for B. So you decide what the text is that the user will be presented with: not some cryptic technical
boilerplate message generated in terms of grammar theory or types or element names.
For example:
<rule context="/html/head">
<assert test=
"string-length(meta[@name="DC.date"]/@content) > 10">
Our HTML documents need a Dublin Core metadata DATE field,
because of our corporate document retention policy.
</assert>
</rule>
In this example, the text message has really nothing to do with the actual XPaths: it gives the message in the best terms for the user to capture the specific business or technical requirement. Note that the assertion statement says what should be found: Schematron assertions are to capture what is supposed to be, not just give error messages like “Ooops I found something bad”.
Dynamic Text
Schematron allows you to pull in any data values into your assertion statements, using
the value-of
and name
elements. So you can construct very specific and useful messages.
Documentation
The title
, p
and emph
elements allow you document your Schematron schema to frame the assertion statements,
so that they can be printed fairly directly out, as text or HTML say. (Contrast this
with conventional schema languages where there is no necessary explanation of the
functionality or intent of content models that can be readily printed.)
Diagnostics
Schematron allows you to further specify a set of diagnostic messages for your assertions,
using the diagnostic
element. So you can put into the diagnostics all sorts of hint messages and information,
rather than diluting the simple clear assertion text. An assert
or report
element specifies the id of a diagnostics element, and this element is evaluated
and added to the validation report that is put out. The text can be dynamically constructed
including values calculated from the document itself, it is not just static text.
Diagnostics are ultimately intended for reading by humans; to provide additional
information intended to feed downsteam processing, consider using the new property
element.
Localization
The diagnostic
elements may be assigned to different human languages, allowing Schematron outputs
in different languages without needing a resource substitution framework. (Plus our
Schematron Skeleton implementation provides built-in error messages in multiple languages,
for developers.)
Role
The role
attribute lets you assign some arbitrary label to each rule or assertion. The most
common use of these is to assign the traditional weights to each assertion of FATAL,
ERROR, WARNING, INFO. Roles could also be provided for the OBSOLETE and OBSOLESCENT
distinction that may be in effect.
See
The see
attribute allows you to provide a hypertext link to documentation or related material
relevant to each pattern, rule or assertion. The Schematron schema can therefore be
integrated into your wider information system.
Let
For developers, long XPaths can be difficult. The let
element allows developers to break up their long Xpaths into shorter, named variables
that can make testing and inspection easier.
Rich Text
In the message of assertions and diagnostics, there are muliple elements allowed to provide rich text, similar to HTML:
emph
allows emphasized phrasesspan
allows phrases to be specially markeddir
allow explicit markup of directionality (for Middle Eastern script)@lang
allows the language of the text to be specified@icon
allows a link to a small graphic, for a richer non-textual experience.
Dynamic
Phase
All other schema languages are based on or enforce the rather bizarre assumptions that times do not change, that schemas do not develop, that all errors are of equal severity, that there is no pipeline of incremental processing, and that we always want to test everything at once (and perhaps always halt after the first error.) Schematron gives first class support to variations of schemas.
The phase
element allows you to name and select any group of pattern
s. You can specify in the command-line parameter or invocation phase you want to validate
using. Only the active
patterns that belong to that phase will be run.
For example, you might have a workflow where your initial automated conversion is expected to get some metadata and basic paragraph and table stuctures correct, but subsequent manual operations will correct the semantic markup, and a final automated process will add some kinds of links. So you could decide to have four phases:
- The first phase might only validate the metadata and basic paragraph and table structures. If there is an error, the input document is sent for special input handling.
- The second phase might attempt to report on likely elements where semantic markup ,ight be expected, to help the manual process
- The third phase might validate the semantic markup
- The fourth phase might validate the links. It would also run the patterns for the first and third phases, to give a complete validation.
Another example might be where there has been a revision of the schema, and some documents in your collection use the old schema and some use the new schema. These could be modelled as two phases, an the appropriate phase run for each document.
Another example might be where you first want to validate the incoming document using a few coarse and efficient tests perhaps even allowing false negatives (in one phase) and then validate the documents that failed with more complex or slow tests (in another phase).
Schematron does not provide any mechanism for sequencing phases or the patterns in a phase. These must be coded in the script or application that runs the Schematron engine.
Parameters
Schematron supports passing of strings through from the invoking script or application. These may include dates, verbosity switches, threshhold numbers, names, or anything that can be tested in the XPaths.
Dynamic Text
Schematron allows you to pull in any data values into your generated assertion statements,
using the value-of
and name
elements. So you can construct very specific and useful messages. The value-of
element evaluates an XPath, which has full access to the current document, in-scope
variables, and SOA resources available using HTTP GET. The name
element allows you to get the name of the current context element, which is useful
where the assertion can be run in a variety of contexts but you want your validation
reports to have specific details which element was involved.
Integratable
Report
The report
element can be used where the assert
element can be: it is like an upside-down assert
: assert
generates an output where its test fails, but reports
generates an output where its test succeeds. So you can use report
to detect features or where you want to report that some error has occurred.
A common error for new developers is to use the assert
element but with text that is not an assertion of what is expected: they are not
using Schematron to create schemas but to specify tests. In this scenario, the report
element is appropriate. For example, not <assert test="not( frog )">ERROR: I found a frog!!!!</assert>
but either <assert test=" not( frog )" role="error">This element should not contain a frog</assert>
or <report test="frog">I found a frog</report>
The report element allows many attributes (see next) that can provide downstream or calling applications useful information.
SVRL
The ISO standard for Schematron also specifies a simple XML output format for the
results of validation, the Schematron Validation Report Language (SVRL). This provides
lists of the assert
elements that failed and the report
elements that succeeded, along with details of which rule
elements were filed. Typically these results include XPaths to allow linking of
results to the original document.
All other standard schema languages strand their users and require them to use custom functionality, which often does not exist. For example, W3C XML Schemas defines a “Post Schema Validation Infoset” yet this information is not available using any standard XML tools. In effect, you are limited to whatever tools and functionality the vendors have chosen to expose.
Locations
The svrl:failed-assert/@location
attribute allows the SVRL report to have an XPath to the original document locating
the context node.
Role
The role
attribute lets you assign some arbitrary label to each rule or assertion. The most
common use of these is to assign the traditional weights to each assertion of FATAL,
ERROR, WARNING, INFO. Roles could also be provided for the OBSOLETE and OBSOLESCENT
distinction that may be in effect.
See
The see
attribute allows you to provide a hypertext link to documentation or related material
relevant to each pattern, rule or assertion. The Schematron schema and reports can
therefore be integrated into your wider information system.
Flag
The assert/@flag
attribute provides a way for multiple assertion to declare that some feature has
been found in a document. A flag attribute either is raised or does not exist. Multiple
assertions can raise the same flag, but the SVRL report will only mention one. Flags
can simplify the processing of the validation report: instead of requiring OR- logic
to say “Did assertion a1, a6 or a7 fail?” you can have those assertions raise the
particular flag.
Properties
The properties/property
element is a container for property
elements. You can specify that a failed assertion or successful report has some properties,
and these properties will be calculated and added to the SVRL results (similar to
how diagnostics are added.) This allows you to add name/value pair properties for
downstream processing.
This is a new feature with ISO Schematron 2016. Several users reported that because
assert and report elements can only contain text or some simple formatting elements,
they needed to create home-made little languages in the assertions to embed data for
downstream automated processing. So instead of using <assert ...>WARN|XBF|123|binglybong|There should be a gruntfuttock</assert>
you can have <assert role="WARN" properties="TLA code sillyword"">There should be a gruntfuttock</assert>
coupled with (at the end of the schema): <properties >
<property id="TLA">XBF</property> <property id="code">123</property>
<property id="sillyword">binglybong<property>
</properties>
Properties can be used to provide static information, such as type annotations, but
also can be dynamically generated using the value-of
element and full access by XPath to the input document, variables and to the URLs.
Properties are intended for consumption by downstream automated processors; if you
are constructing messages intended for immedidate human consumption, consider using
the diagnostics
element.
Dynamic Text
Schematron allows you to pull in any data values into your generated generated assertion
statements, diagnostics and property values, using the value-of
and name
elements. So you can construct very specific and useful text. The value-of
element evaluates an XPath, which has full access to the current document, in-scope
variables, and SOA resources available using HTTP GET. The name
element allows you to get the name of the current context element.
Modelling
Patterns
The pattern
element is the primary abstaction in Schematron. We are not checking types, we are
checking for the presence or absence of patterns in documents. A pattern is a series
of rules, where any node in the XML document will only match one rule (the first that
matches); in each rule a bag of assertions will be tested. So patterns can be larger
than just a single element.
For an example of a pattern, consider the “table”. A table in conventional schema languages is just an undifferentiated bag of declarations for the individual elements: there may be declarations for an element type called “table” but there is no way to group the whole pattern together. In Schematron, by contrast, we can define a pattern called table, and declare rules for the tables, sections, colspecs, rows, and entries. We can declare assertions to tie in the number of allow entries in a row to the numbers specified by the column specification elements. We can specify constraints that adhere to descendents such as that a table may not contain a table underneath it anywhere. The assertion texts can speak in terms of “columns” even though there is no columns in the data model or markup as such.
Conventional schema languages utterly fail to model patterns in documents that involve more than parent/child relationships.
Abstract Rules
The rule
element can have an attribute abstract="true"
and an id attribute, instead of a context specification. Other rules can then use
the extends
element to pull in (mix-in) those constraints. (In XML Schema term, an abstract rule
is a base type which can be further restricted.)
Abstract Patterns
The pattern
element can have an attribute abstract="true"
. It is used by declaring some real pattern using the is-a
attribute, and supplying parameters that will be substituted into the appropriate
locations of the abstract pattern. So an abstract pattern can be customized to specify
the same pattern that has different markup representations. For example, our documents
may have normal tables, but also some elements with specific names that in fact have
the same structures as tables: we make an abstract pattern for tables in general (“A
table must have rows; rows must have entries” etc.) and the provide the particular
Xpaths when declare the concrete pattern. Abstract patterns can also be used where
there are localized versions of schemas, for example the same schema translated to
use Chinese character markup.
Subject
The rule/@subject
attribute allows you to specify a path to some node in the document that you want
to mark as the subject of the assertions. Sometimes the context node specified by
the rule
element may just be some convenient element for a workable XPath, rather than the
actual subject. The SVRL validation report will have the XPath to the subject in its
location attribute.
Efficiency
Rule Contexts
Efficiency Schematron largely depends on the underlying XPath or XSLT engine being used. But Schematron is designed to be much more efficient than some other subsequent XPath test languages which only provide a single absolute XPath to test.
- First, Schematron allows multiple assertions to be tested for each rule that matches a context.
- Second, Schematron rules form if-then-else chains, so that a more specific rule can mask a less specific rule, avoiding unnecessary tests and elaborate negative predicates.
- Third, by using a wildcard in the last rule/@context of the pattern, you can catch all the nodes that were missed by the previous rules: much more efficient than elaborating and excluding previously caught nodes.
WARNING: XSLT engines vary enormously in their efficiency or speed: in some cases by 20:1. Some engines will have a performance explosion on the descendent, following or preceding axes. So if you have real-time constraints, do not choose an XSLT engine merely because it is bundled with the platform. HINT: If you are using Schematron called from some scripts and running an XSTL engine using Java, where the time of starting up the JVM may be grater than than the time of validation, you may consider adopting some technology like nail-gun.
Variables
The let
element allows variables to be declared. These are scoped to parent element and cannot
be overwritten during that scope. Variables allow initial portions of long repeated
paths to be evaluated only once. Variable evaluation is usually lazy, meaning that if the variable I never needed, it is not evaluated.
Query Language Binding
The /schema/@queryBinding
attribute allows you to select which query language to use inside attributes such
as rule/@context
and assert/@test/
Typically these would be “xpath
“, or “xslt1
” or “xslt2
“. Selecting XSLT2 unlocks many of the high-efficiency functions supported there,
such as full regular expressions.
Keys
If you are using an XSLT Query Language Binding, then the xsl:key
element is available to assign more efficient random access using keys.
Phases
The phase
element let you select a subset of patterns and only validate against them. So you
only need to validate at any stage of a workflow the specific patterns that are relevant
at that stage.
Implementation Features
Schematron implementations may support extra options to tailor the execution. For example, an option to terminate validation after the first failed assertion is found. Or to skip checking attribute nodes to see if they match any rule context.
HINT: Schematron is great for capturing requirements and getting a working initial version of your validation up fast, even if your unusual real-time constraints then force you to reimplement some more targeted hard-coded system.
Custom Functions
Most implementations of Schematron sit on top of an XSLT1 or XSLT2 implementation. Indeed, many XSLT1 implementations support EXSLT libraries, which provide some of the functionality of that XSLT2 brings. XSLT2 allows you define your own custom functions, and most XSLT implementations also provide some custom functions, and allow you to invoke some functions from the platform: one common use is to call ODBC or JDBC to talk diretly to a DBMS.
Schematron inherits and allows this: of course, if you do it, you lose platform-independence, and you may then need developers who both know Schematron/XPath and the platform language being used. But if there are real-time constraints that need to be satisfied, or particular tests that need to be performed, it may be necessary.
Schematron is a Web Technology
Schematron variables can be loaded with XML data retrieved from some server using
the doc()
and document()
XPath functions. Any simple web services that returns and XML document in repsonse
to an HTTP GET request can be connected to (i.e. POX not SOAP). The Schematron script can construct the URL and provide whatever URL
parameters are needed. For example, instead of loading a large XML lookup table,
create a web service that provides access to REDIS: this removes the killer loading
latency for large files.
In general, it is better to use these kinds of web services rather than including Java or .NET code into the Schematron schema. Web services provide a separation of concerns and may regulate access better.
Easy
XPaths
Schematron itself is so simple in part because the heavy lifting is done by expressions in an embedded query language: typically this is XSLT1 or XSLT2, but there have been Schematrons using many other XML query languages, such as XQuery, STaX streaming, RDF queries and even CSS/JavaScript.
In fact, the best way to become a Schematron guru is to get a really good understanding of XPath!
Namespaces
XML Namespaces are a continual headache for developers: they never work exactly the way you think. And the scoping rules mean that in order to check an instance document, you may have to check every ancestor element to find which prefixes are being bound to which namespaces, or if default namespaces are in use. XML Schemas uses a different rule to figure out which namespaces are in use for declarations to XSLT which is different from general documents. (In fact, there is no general W3C standard for knowing how bind “QNames in attributes or content” to namespaces.)
So Schematron cuts through this complexity: the only place that a namespace prefix used in an XPath can be declared is in the /schema/ns elements. There is no use of defaulting, no need to look up an ancestor hierarchy or track through multiple declarations. Variables
Complex XPaths can be evaluated in sections using named variables. This helps the next reader of the schema, it also improves efficiency, re-use and debugging.
Fallthrough and Wildcards
Rule contexts are evaluated in the order they are written. This makes editing and understanding easy.
Within each pattern elements, the context selection of the rule elements act as an
if-then-else chain (or “switch”). So a node in a document will match at most one rule.
This can dramatically simplify the XPaths, because a rule context does not need to
have predicates to avoid nodes that have been selected by a previous rule: for example,
if one rule context is “person[@age <= 16 or @certify = 'NSW4C']
” then the next rule does not need to be “person[@age > 16][not(@certify ='NSW4C')]
” it can just be just “person
“.
XPath provides a wildcard pattern “*
” and “@*”. You can use this to match any node that has not matched previous rule
contexts. One use for this may be to report unexpected elements or attributes, which
may indicate a problem in the document (an error or evolution) or perhaps some shortcoming
of your schema.