Feature Grammars – extracting general features of documents

Posted on November 25, 2016 by Rick Jelliffe

[Update 2021-10-22]

What do I mean by a feature? I mean some simple pattern that is found in a document’s markup that can guide how the document should be individualy processed. A feature in a tax return document might be that it is a late return; two features of a court judgment document might be that it comes from a particular court but does not use the styles we expect from that court.

So I have made a little schema/reporting language Feature Grammars as a tool or technology for feature extraction and representation in XML. It is simpler than Schematron report elements or @flags, and perhaps more efficiently implementable, but could be used as tool to refactor complex transformation systems by bringing out some of the logic into a model: an extra layer in the processing.

It has a few novel (are they?) features:

  • It allows grammar-like modeling of the hierarchical feature set combinations in the document, and the reporting as a feature tree (rather than a feature vector) that says “we looked for this feature because we found that feature.“
  • It uses XPath for feature detection.
  • It supports feature detection in multiple document types, with alternative detectors for each feature attaching to the same grammar.
  • It follows the same architecture as Schematron, using an XSLT script to generate XSLT code, and indeed could be used to preprocess documents for Schematron or to post-process the SVRL. However, it is not intended as schema language as such.

An open-source proof-of-concept prototype implementation is at:
https://bitbucket.org/CharmOnset/featuregrammars

For more details, see Feature Grammars (PDF)

Feedback and improvements welcome!

It grew out of some years of thinking about remaining gaps with Schematron and XML query languages, and some experience with very large corpuses where dozens of different data sources (indeed, scores of sources, over time) fed very different documents to be converted to a common kitchen-sink transitional DTD: having a common schema made it look like it had been reduced to an N:1 problem, but this disguised that the documents were clustered into, in effect, different discrete languages depending on their source.

So a Feature Grammar schema is like a grammar where both sides of the production are non-terminals, unlike DTDs where both sides of the productions contain element names. The terminals are XPath expressions. In the following example, we have a feature called document (which will be true if we find /document in our input document) and which then expects

<feature name="document"
          model="AU-document | NZ-document?" note="root">
     <find in="input" match="/document"/>
</feature>

<feature name="AU-document"
          model="#EMPTY">
      <find in="input" match="/*/@country='AU'" />
</feature>
          

<feature name="NZ-document"
          model="#EMPTY">
      <find in="input" match="/*/@country='NZ'" />
</feature>

What this says is that if we find in our document "input" a top-level /document element, then this matches the feature "document", and therefore the enables the possibility that the models AU-document or NZ-document (tested later on, not shown here) may be possible.   Then we look to see if AU-document or NZ-document matches (or neither).

Feature plan

Just as Schematron may produce SVRL, Feature Grammars may produce a Feature Plan: this is a simple XML document that uses the feature names as found, and any features found underneath.  So it might have

<document><AU-document/></document>

Markup-independent feature extraction

To extend the idea further: the same feature grammar can apply to documents with completely different element types. For example, in the following case we have a document called “input” that will have one set of XPaths, and another called output that will have another: so this might be a coarse unit test on a transformation, that the features found in the coming into a transformation should all be still found in the document after transformation.

<feature name="document"
model="AU-document | NZ-document" note="root">
<find in="input" match="/document"/>
<find in="output" match="/book"/>
</feature>