Six kinds of validation using Schematron

Posted on December 4, 2016 by Rick Jelliffe

Document Invariants

This is the most straightforward use of Schematron: your Schematron script is a schema that states what is supposed to be found  the data, metadata, structures and links in any or every XML document.  These may be fixed  such as “All documents should have head and body sections, to hold metadata and data respectively”, or they may be co-occurrence constraints such as “If the month is February, April, June, September or November then the day should be less than 31, according to the Julian Calendar”. (Note that these assertion specify a context, a test, and a rationale, which I think is best-practice.)

This kind of Schematron is used for input or output validation, especially for firewalling: to detect incorrect data received from some second party and prevent harm.

Schemas using this will typically use Schematron assert elements.  The document is valid or invalid.

Known Anomaly Reporting

In order to get good quality-in-use out of a system, some way of providing process improvement and continuous quality improvement can be implemented. In this case, when some problem in the system or markup is detected and addressed by developers and data entry staff, then a corresponding Schematron rule is put into place.  This allows the quality of the fix to be confirmed, and for information about the problem to be captured.  Feedback is often regarded as the core of quality improvement, and it is particularly useful to allow for these in the early days of a newly deployed product. An example of such an anomaly is “It is an error for a document to say that an Australian citizen has more than one current spouse.”

Schemas using this will typically use Schematron report elements.  The document is as-expected or anomalous.

Unit Tests

A unit test is where there is an artificial and custom-made input document and a transformation on that document that is expected to yield specific results. For example, “Test 21: expected value of total is  36.  Found: 82”

Schemas using this will typically use Schematron report elements.  The test passes or fails.  The found results can be dyamically generated using Shematron dynamic assertions or property elements.

There are several widely used technologies that took up Schematrons XPath-based approach to support unit testing as the primary goal, that readers may also find useful:  OASIS CAM, and Selenium.

Input/Output Invariant Comparison

This kind of Schematron schema is useful for Quality Control as well as QA. I have used it for large conversion projects, where the organization was changing the schema for all its data and needed to make sure no data have been dropped or moved or corrupted.  One project involved hundreds of thousands of large documents and over 1,000 assertions, some assertions with as many as 40 predicates on the XPath: Schematron proved itself capable of supporting this kind of task well.  In this type, the schema tests the invariants that should exist between the inputs and the outputs. For example “The output document should have the same number of paragraphs as the input document in the same order.”

Schemas using this will typically use Schematron assert elements.  The output corresponds or does not correspond. You need to have some system to allow the Schematron process access to both the input and output documents: the simplest way is often just to combine the two documents under a single dummy element and validate that.

Sometimes the Xpaths tests for an assertion may not exhaustively test the assertion, but provide a good-enough stab.  For example, the assertion above might be tested by generating a string from the input document made from the first letter of each paragraph. Then compare that with the string generated from the first letter of each paragraph in the output document: if they are different then there is not the same number of paragraphs or the order has changed or there is some other problem.

Round-trip Testing

Where your system already supports round-tripping of data (i.e. from XML to a database, and from the database out to XML again) then round-trip testing is useful.  But it is also useful when the conversion is highly complex and makes the Schematron XPath too complex: removing a layer so that the tests can be written only in term of the input documents can simplify the Schematron Xpath, at the that the return conversion itself needs to be debugged and maintained: so round-trip testing may be problematic if the forward trip is not well-specified.  For example, “The round-tripped document should have exactly the same number of SKU line items as the original document, however they may be in different order.”

Schemas using this will typically use Schematron assert elements. The returned document has significant differences or no siginificant differences.

There is a close connection between round-trip testing and “diffing”.  Diffing tools are more appropriate than Schematron for efficiency and detection, however they do not have any ability to explain the fault and they need customization to not report differences that may be irrelevant (order differences of semantically unordered elements, dropped obsolete elements, etc.)  In some situations it may be efficient to only run Schematron on original/returned documents that have differences detected by a diff engine.

Testing against Corpus Information

Sometimes you need to validate a document against a collection.  For example to test link integrity “The person identified in the report should be a registered voter.”  The brute-force approach of loading all documents at the same time is simply not appropriate, for collections of any size.  So this type of testing is an anti-pattern rather than a pattern.

The appropriate way to do it is to maintain some database available from a URL using GET from Schematron XPaths.  The XPaths can construct the appropriate URL.  (The SOA service should provide POX Plain Old XML rather than SOAP: if your services provide some complex API like SOAP or some non-XML representation you probably will need a mediator service to provide just XML.)  In some cases, it may just be that a simple file can be accessed and regularly updated, such as a database base dump, to provide looser coupling between the database server and the systems running Schematron.

Compound Document Testing

I have been writing as if there is always just a single input XML document and a single output XML document. However, this is not always true: ISO Schematron can cope with compound documents (i.e. the input and output documents are ZIP archives, perhaps unzipped).