Integrating Schematron with syslog

Posted on January 8, 2017 by Rick Jelliffe

One of the cool thing about Schematron, is that it allows a high level of integration into systems that might want to consume the results of a validation.  The result of validation is not just the basic valid/invalid distinction (like DTDs), cryptic messages that frustrate end-users, or some abstract and complex “Post Validation Infoset” data structure (like XML Schemas)  but a rich and fairly flat XML document, the Schematron Validation Report (written in an ISO standard language, SVRL, which you can find in the ISO International Standard for Schematron.)  I write “validiation” but Schematron validation can mean all sorts of pattern detection.

Syslog is an internet protocol described in RFC 5424  to convey event messages. It is probably older than most readers now! Integrating Schematron into a messaging framework like syslog can be really neat, because many logging and dashboard systems allow integration with syslog.  In the age of cloud computing and containers, we want stateless applications, so for validation events that are not consumed by the local application, we need to a push those events outside our system.

I won’t provide code here on how to create syslog messages: here are lots of APIs or code samples on the internet for your particular platforms: e.g. for powershell you might look  or  here.

Instead, lets look at the information in an SVRL file that can be used to drive a sylog message.  So the scenario here is that we have some service or application that runs Schematron to validate an XML document (or a determinate bunch of them) and the validation produces an XML SVRL file, which our service (it might just be a script) then iterates through, looking at each top-level (XPath /*/*)  element and generating a corresponding syslog call if appropriate.  Usually this would be for the /*/svrl:failed-assert and /*/svrl:successful-report elements.

Priority

Syslog expects a priority, a hostname, a timestamp, and a message.  The message may be quite rich, and we will look at that later.  The hostname and timestamp does not interest us.  The priority is made by adding a “faculty” number with a “severity’ number. Lets say our faculty is “local7” (which might indicate that this is no something we want system administrators to see first up, and is an architectural decision)  which is number 184, and our severity number from 0 to 7.

Schematron allows assertions to have a “role” attribute, and usually these roles might be ERROR, WARNING, INFORMATION.  Syslog defines 8 severity levels, and if we want to, we can use these as the values for role:  emerg, alert, crit, error, warning, notice, info, debug.

So in our scenario, the script uses the values of svrl:failed-assert/@role and svrl:successful-report/@role to provide the severity number in order to generate the priority for the syslog message.

Sometimes developers are nervous about complex XPaths and want to confirm that the Schematron rule has indeed be fired for each subject element of interest, even if the there were no failed assertions. In that situation, they can generate a syslog enrty for /*/svrl:fired-rule with a severity of debug.

Message

So lets get to the syslog message content.  You can send any unicode string using syslog: syslog uses UTF-8 prepended with a BOM mark as UTF-8 bytes: it is a repurposing of that character from what Unicode intended, but it is a good idea.    (BOM = %xEF.BB.BF)

So the simplest thing to do is to just take your assertion text and put it out as that message.  This will be the text under  svrl:failed-assert/text:  Schematron allows your schema to generate arbitrary and custom human-readable messages, and is in a sense a report-generating (but not formatting) language.  The message might say “A date in Febuary 2017 should have a maximum value of 28: found 30“.

So that is good: in our scenario now when Schematron detects something, it sends a message to a logger running on some other system which includes our custom message, and a priority, timestamp and some vague information about the facility that generated the message. And that is where it is not good enough.  We need our log to know which application generated the error, and probably for which document.  Because our streaming system may not keep state, or save problem or interesting documents, we may also want to use information from inside the XML document to identify the document, rather than filenames etc.

One approach is to make some homemade additions to our message string: for example our script could use pipe-separators to provide more detail: “Form123Checker|instance123456|A date in Febuary 2017 should have a maximum value of 28: found 3″ or whatever.

Structured Data

But syslog provides a way to tag simple structured data, rather like element tags with [] instead of <>.  There are some predefined structured data elements for timing, originating process and sender metadata.  But you can add your own metadata.  One approach would be to simply copy over the useful information from the SVRL entry to the  syslog structured data: so the logger has access to as much information as possible.

In concrete terms, you would take create a structured element called fa or sr (lets be kind to storage) with attributes from the @role, @id, @location and @flag attributes from each svrl:failed-assert and svrl:successful-report.  You would also create structured elements for each svrl:diagnostic-reference and svrl:property-reference elements that are under the svrl:failed-assert/@role and svrl:successful-report/@role too.

Wrinkles?  Structured elements names are limited to ASCII, though the parameter values are UTF-8. As I read the BNF, the structured IDs do allow -:_ characters, so if you wanted you could use SVRL element or attribute names without drama, if you wanted to. But I think terseness is a virtue when logging.

And when converting out a text element into simple text, you would need to decide how to handle rich markup if the Schematron schema uses it. Schematron allows a small amount of rich markup in its messages: to support internationalization requirements and emphasis.  The simplest thing to do would be just to strip out any markup in messages and retain the text.  Otherwise you are back to providing some homemade tagging format in the message string, which would then require extra work to make work if the consuming systems allow message parsing.