Optimizing Schematron using @saxon:memo-function

Posted on February 28, 2017 by Rick Jelliffe

Tony Graham mentioned in an email his use of Saxon’s optimization hint attribute xsl:function/@saxon:memo-function to memo-ize the values of some functions.  He had investigated it for his Open Source focheck project that checks XSL-FO scripts. I was intrigued as I had never used this technique, and Tony kindly provided details for me and readers.

Memo-izing is where the implementation stores the return value for a function between invocations, to avoid it being recalculated for the same arguments: it is a kind of single-value caching.  This is a useful technique where you are validating  XML documents that use subclassing (e.g. DITA, XSL-FO, or my highly generic documents ), where you use attribute values to supply the specific name for the element rather than the normal generic identifier: for example if you have a structured HTML document where the names of structures you are validating are all in the @class attribute.

From Tony:

If, and it’s a big IF, your Schematron is doing a lot of running the same XPath test expressions on the same values and you are also using either Saxon PE or Saxon EE to run your XSLT2 binding, then you may benefit from using saxon:memo-function ( http://saxonica.com/documentation/index.html#!extensions/attributes/memo-function) to get Saxon to short-circuit reevaluating the same expression over and over again and instead just return the result saved from the first time those parameter values were used.

saxon:memo-function is a Saxon extension attribute that may be used with xsl:function.  With saxon:memo-function="yes", Saxon caches the result from time the function is called.  When the function is called again with the same parameter values, Saxon returns the cached result for those parameter values rather than reevaluating the function just to return the same result.  There are some caveats in the Saxon documentation about when not to use saxon:memo-function — for example, when the function has side-effects or it accesses the current context — but it should be generally usable with the expressions in Schematron tests.

Whether, and to what extent, using saxon:memo-function can speed up your Schematron processing entirely depends on your Schematron and your documents.  As with most things to do with XSLT performance, you need to test it with realistic documents and the particular XSLT processor version that you use before you can say for sure.

An example where saxon:memo-function does help is the focheck framework (https://github.com/AntennaHouse/focheck) for validating XSL-FO files. XSL-FO properties are expressed as XML attributes, which focheck needs to check.  However, property values can be expressions in the expression language defined in the XSL 1.1 Recommendation (https://www.w3.org/TR/xsl11/#d0e5032), so focheck has to evaluate the property value expressions before working out if the result is an allowed value for the current property.  As a consequence, the focheck Schematron has hundreds of:

<let name="expression" value="ahf:parser-runner(.)"/>

where ahf:parser-runner() is:

<!-- ahf:parser-runner($input as xs:string) as element()+ -->
<!-- Runs the REx-generated parser on $input then reduces the parse
tree to a XSL 1.1 datatype.  Uses @saxon:memo-function extension
to memorize return values (when used with Saxon PE or Saxon EE)
to avoid reparsing the same strings again and again when this is
used as part of validating an entire XSL-FO document. -->
<xsl:function name="ahf:parser-runner" as="element()+"
saxon:memo-function="yes"
xmlns:saxon="http://saxon.sf.net/" >
<xsl:param name="input" as="xs:string" />

...
</xsl:function>

saxon:memo-function is used to avoid parsing the same expression over and over again just to return the same result.  There’s three main reasons why this is useful: the desire for a consistent appearance in the formatted result means that a lot of the property values are repeated throughout the XSL-FO document; XSL-FO documents are usually generated using XSLT, so generating the same property values for the same element type is easy and happens a lot; and, lastly, most property values in the XSL-FO XML are single tokens rather than complex expressions, so running the parser on a single token that’s been seen before adds a lot of overhead compared to just using saxon:memo-function.

To test the effect of saxon:memo-function, I validated an 808 kB XSL-FO document in oXygen 18.1 using focheck that alternately had saxon:memo-function enabled and disabled.  To avoid any influence from oXygen possibly caching the XSLT stylesheet that implements the parser, oXygen was restarted each time the stylesheet was changed.  With saxon:memo-function="yes", the Schematron component of validating the document took a minimum of 2 seconds; with saxon:memo-function="no", the Schematron component took a minimum of 5 seconds.

saxon:memo-function applies only to xsl:function.  If you are doing a lot of the same tests but don’t have an xsl:function for the tests to which you could add saxon:memo-function, then you might want to add an xsl:function just so that you can add saxon:memo-function to it.  As before, however, you need to test with your own documents to determine whether or not that’s useful to you.

Consider a DITA-related Schematron pattern such as:

<sch:pattern>

<sch:rule context="*[contains(@class, ' custom/paragraph ')]">
<sch:extends rule="custom-paragraph"/>
</sch:rule>
<sch:rule context="*[contains(@class, ' topic/p ')]">
<sch:extends rule="topic-p"/>
</sch:rule>

...
</sch:pattern>

There’s potentially a lot of string matching going on to determine which rule applies to an element.  If you think that there is enough repeated testing of the same values to make it worthwhile to use saxon:memo-function, then you could change the pattern to:

<sch:pattern>

<sch:rule context="*[my:contains(@class, ' custom/paragraph ')]">
<sch:extends rule="custom-paragraph"/>
</sch:rule>
<sch:rule context="*[my:contains(@class, ' topic/p ')]">
<sch:extends rule="topic-p"/>
</sch:rule>

...
</sch:pattern>

where my:contains() is:

<xsl:function name="my:contains" as="xs:boolean"
saxon:memo-function="yes"
xmlns:saxon="http://saxon.sf.net/" >
<xsl:param name="class" as="xs:string" />
<xsl:param name="specialisation" as="xs:string" />

<xsl:sequence select="contains($class, $specialisation)" />
</xsl:function>

The memoization is most applicable (only useful?) where you have lots of the same sets of parameter values in your function call such that the overhead of the processing time for checking and caching is less than the processing time that you save by just returning the known value on the second and subsequent time that the function is called with the same parameter values.

Now, if Saxon uses a string comparison rather than, say, a hash to find previously used string parameters, then my DITA example would make things slower because Saxon would be checking to the end of both strings every time instead of contains() returning as soon as it found the second string inside the first.

For a document that contains a large enough number of paragraphs, the overhead added by saxon:memo-function could be outweighed by the saving in not performing as many string comparisons, but for smaller documents containing fewer paragraphs, there might be no advantage.  Whether or not saxon:memo-function can speed up your Schematron processing is something that you’ll have to determine for yourself.