Is XML only half finished? The X Refactor

Posted on February 13, 2018 by Rick Jelliffe

The W3 Standard for XML is now 20 years old. I sent original of this post to the XML-DEV mail list suggesting a different vision for XML: reconstruct SGML’s power but as a definite pipeline of simpler stages, but without DTDs or SGML Declaration. (This version: 2018-02-13)

Where is XML thriving? 

  • Industrial document production using semantic markup: the traditional SGML market of high complexity, multi-publishing, and long life span.
  • Desktop office formats using XML-in-ZIP: ODF and OOXML
  • Data that has mixed content
  • Data that absolutely needs schema validation as it goes through a network of processing: such as HL7

Where is XML not thriving?

  • SGML on the Web
  • Configuration files (property files/JSON/YAML/etc)
  • Minimal markup (markdown wins)
  • Where the complexity of XSD etc does not provide a targetted enough bang per buck.

What have been the pain points for XML over the years?

  • Inefficiency.  Processing pipeline often need to reparse the document. All processing is done post-parse, and consequently XML pipelines are unnecessarily heavyweight.
  • It is not minimal. An even tinier version might be faster, easier to implement, neater.
  • Maths users need named entities.
  • Domain Specific syntaxes show that terseness should not be dismissed as being of minimal importance.
  • DTDs were simplified and made optional on the expectation that a downstream process could then do the validation: however, only Schematron has a standard format for reporting types (linked by XPath) and errors. There is no standard XML format for the XSD PSVI.
  • Namespaces are crippled because they are not versioned: and processing software does not support wildcarding or remapping. Consequenty, changing the version of the namespace breaks software.
  • XML element or attribute names, PIs and Comments strings that  containing non-ASCII characters may be corrupted by transcoding out of Unicode.
  • A bit too strict on syntax errors.
  • XSD tried to elaborate every possible use of parameter entities and tease them out into separate facilities.  It did not reconstruct several major ones, notably conditional sections. This has the consequence of reducing XML’s efficiency as a formatting language.
  • XInclude only partly reconstructed the functionality of general entities.
  • The XML specification sections on how to represent “<” in entity declarations gives me a headache.
  • Little domain specific languages have not gone away:  we have dates, TeX maths, URLs, JSON, CSV, and so on.
  • XSLT is becoming more complex and full-featured with the result that there must be fewer complete implementations. Because there is no-where else, to go, it has needed to add support for JSON, streaming and database-query-influenced XPaths.

So…Is there a way to address these pain points and evolve XML?  I think there is, and to clawback many features lost from SGML while keeping a neat, simple pipeline that causes the least disruption to current APIs.Here is what I am thinking. XML is evolved into a notional pipeline of up to five steps:  XML Macro Processor, Fully Resolved XML Processor, Notation Expander, Validation Processing, and Decorating Post-Processor.  Lets call it “ The X Refactor “.

http://schematron.com/wp-content/uploads/2018/02/X-Refactor.png

What it gives is an XML-like language with more SGML features:

  • Start with standalone, DTD-less XML.
  • Build-in standard MathML/ISO Public Entity sets
  • Support namespace mapping, to cope with versioning better, using ISO NVRL
  • Allow modern validation with ISO DSDL or XSD
  • Allow conditional marked sections, a superset of SGML, with m4-sized logic expressions, for versioning and customizable documents
  • Allow embedded “little languages” converted to XML infoset, a better version of SGML’s SHORT-REF
  • Allow value-defaulting and post-validation augmentation of the info set, similar to SGML LINK PROCESS
  • Allow more flexibility in tagging and error recovery, in particular end-tag implication and close delimiter ommission.
  • Allow character references in markup.
  • All meta files would use XML element and PI syntax.

1) “XML Macro Processor” Full featured macro-processor, taking the features of M4:  text substitution, file insertion, conditional text.  Just before the advent of XML, Dave Peterson had proposed to the ISO committee enhancing the marked section mechanism with better conditional logic (nots, and, or, etc), so this is not a left-fielddea.  (This is an enhanced standalone version of  what SGML calls the  “Entity Manager”. )

    Suggestion:  Input: bytes. Output: fully resolved XML Unicode text.

  •  Read XML header and handle transcoding to Unicode.
  •  Parse for <![  and subsequent ]> (or ]]>)on stack, and perform macro expansion and interpretation.  I.e. strip DOCTYPE declaration, perform inclusions, don’ pass on sections, delimit data in CDATA sections or CDATA entities to text with numeric character references.  The value of variables of marked sections (while looking like PE references i.e. %aaaaa;) do not have their definition taken directly from the prolog but must be provided out of band, i.e. as an invocation configuration.   (This is a “hygenic” macro processor, because macros cannot be defined in the document and therefore risk complicated meta-macro-hacking.)
  • Expand general entity references to direct unicode characters. Entity references (while looking like General entitity reference ie &aaaa; ) are not defined in the prolog by must be provided out of band in some Entity Declaration Document. The standard ISO/MathML entity sets are predefined. (For safety, a “&”, “&amp/”, “&amp?” etc at the end of an entity, or a “>”, “/>” etc at the start of an entity would always be converted to the built-in named character reference: this would reduce the chances of strange hacks?)

Benefits: 

  • Allows major simplification of the XML processor.
  • Support lightweight customizable documents, without having to load a whole document tree.
  • Reconstructs SGML’s marked section mechanism
  • Removes the vexed issue of people who want to use XSD and named character references
  • Optionally supports “;” ommissibility on entity and numeric character references, a la SGML.
  • Documents can be transcoded without corrupting non-ASCII characters in names, PIs and comments.
  • The macro processor removes the need for parameter entities, because it can be used on a schema or other XML document. And it provides a way of customizing schemas using a general mechanism.

Incompatabilities:

  • Entity and numeric character references will be recognized where they currently are not.
  • Edge cases will exist, such as where an attribute value contains <![ it will be recognized.
  • Marked sections is not defined as synchronised with element tags, which could allow various hacking problems. (Implementations are not required to support marked sections in attributes or that are asynchrous to the element tagging and such markup is deprecated and unsafe.)

2) Fully Resolved XML Processor  Stripped back XML processor without encoding handling, DOCTYPE declaration, CDATA sections, entity references, numeric character references.Suggestion: Input: Unicode text.  Output: XML event stream.

  •  Recognize start-tags, end-tags, comments and PIs.
  •  As error-handling, may allow STAGC ommission like SGML and HTML <p<b>
  •  As error-handling, may allow start- and end-tag impliciation, using a Error Handling Specification document, like SGML and HTML.
  • An entity reference would be an undeclared error.
  • A numeric character reference would be accepted but generate a warning.

  Benefits:

  • The input is the ultra minimal XML that some have been calling for.  Rather than “simplifying XML” by abandoning docheads, we refactor XML to support both docheads and people wanting a minimal XML.
  • Conforming subset of current XML
  •  Compatible with SAX

 Incompatabilities:

  • Allowing minimization and tag implication may be an incompatability, but it would be an error handling feature that does not need to be enabled.

3) Notation Expander   Process the contents of some element and replace delimiters with tags. The processor uses a Notation Definition Specification, which uses regular expressons and reuses the same tag implication fixup as the Error Handling Specification of the Fully Resolved XML processor above.  The elements generated are synchronized with the containing element. Element markup inside the notation is allowed or rejected (as a kind of validation)     Specialist notation processors are also possible: namely for JSON, and for the QuickFixes (Schematron parse and fixup), and to reconstruct the XML SHORT REF mechanism.  Stretching it a bit, and HTML 5 style element housting might go in this stage too.    Input:  XML Event Stream.  Output: XML Event Stream.  Benefits: This is to reconstruct the idea of the SHORT-REF>ENTITY-REF->MARKUP  mechanism in XML, where in a context you can define that a character   like * should be shorthand for entity reference &XXX; and that this entity could contain a start tag <XXX> which would then be closed off by implication or explicitly or by some other shortreffed character.

  •  Short refs had three kinds of use cases:
    •   first was for repetative tabular data, such as CSV, where the newline and , or | characters could be expanded and recognized.  This use case would be supported.
    •  second was for embedded little languages, for example for mathematical notation. However, the absense of a mechanism to declare infix shortrefs meant that this was crippled.  This use case would be supported
    •  third was for markdown-style markup. This is not a supported use-case, as there is a thriving markdown ecosystem and community doing fine without it, and because of the issue of double delimiting
  • Support some simple parsing tasks that otherwise might require heavyweight XSLT, but do it within a more targetts regex framework.
  •  Compatible with SAX

4) Validation ProcessingInput: XML stream  Output: Enhanced XML Event Stream (PSVI), or [XML input stream, XML validation report langage’This can use any subsequent DTD stage, or XSD, or an combination of the DSDL familiy (RELAX NG, Schematron, CRDL for character range validation, NVRL for namespace remapping, and so on.)Benefits:   * The technology for this part of the tool chain is available   * Except that there needs to be an “XML” output from validation. Consequenly either a type-enhanced standard SAX (for a Post Schema Validation Infoset),  or a dual stream of the input plus an event stream of the validation report linking properties and errors to the original document (i.e. ISO SVRL)5) Decorating Post Processor  This would perform simple transformations steamable insertions into the event stream.  (It could also be run before validation if needed.)   Suggestion: Input: (enhanced) XML Event Stream, Output: (enhanced) XML Event streamBenefits: 

  •  Support attribute defaulting taking over from DTD.   RELAX NG and Schematron per se do not alter the document stream.
  •  Reconstruct the LINK feature of SGML, that allows bulk addition of attributes (such as formatter properties), reducing the attributes needed to be marked up or in the schema. Allows process-dependent attributes to be added on the fly.
  • Supports feature extraction and markup.  For example, a Schematron processor could be made that injects into the event stream extra attributes based on the new sch:property capability of Schematron 2015.
  • Support some simple decoration tasks that otherwise might require heavyweight XSLT.
  • Compatible with SAX

What would it take?   1) Split apart an XML Processor into two parts. Dump DOCTYPE processing.  Define and add a marked section logic expressions (  AND | OR | etc) to the Macro processor.  Implement as a text pipe or as an InputStream.  Add the error recovery.   (An existing XML processor will accept Fully Resolved XML as is.)   2) Make some generic notation processor  (anotated BNF + tag implication).  A standard language should be adopted.. Make specialist processor for math, and XML Quick fix.   Allow invocation either by a PI as the first child of the parent to flag the notation, or by some config file. Implement as text pipeline or SAX stream processor.    3) Validation technology exists.  But how to sequence it is an open question (that DSDL punted): please not XProc.   But does SAX support the PSVI?    4) A simple streaming substitution language would be trivial to define and implement as a SAX Stream. It would be a processing decision to add this, but there is no harm in notating this with a PI.   A standard language should be adopted.So I don’t see this is very disruptive, at the API levelAfterthought:20 years ago, when we were chopping up SGML to formulate XML, the thought was that we could afford to remove much useful functionality either because (such as with schemas) it could be upgraded into a different stage in the pipeline or (such as with conditional marked sections) because it was a back-end task suited inside servers rather than the wire format (SGML-on-the-Web.)We left the job unfinished: the pipeline is incomplete, and the back-end uses turned out to be the main use-case and has been neglected. The aim is not to reconstruct all of SGML, and certainly not to make a monolithic system with lots of feedback: we don’t need an SGML Declration 2.0!  But I suggest that filling out the pipeline would support many use cases.

Example

An example is always useful.

<?xml version="2.0" encoding="Big5" ?>
      <?require validation="no" ?>
      <?require end-tag-ommission="p chapter b" ?>
      <?require imply-start-tag="div/p/text()" ?>
<chapter>
<div>
<p>This is some <b>text</p>
Looks like HTML &@alpha</div>
<![ %old [ <p>Hello oldies</p>]]>
<![ not(%old)or %teen [ <p>Hello kids</p> ]]>
<maths>
      <?require notation="text/tex"?>
\begin{document}
\begin{align*}
\[\int _0^\infty f\mathord(x+y\mathord) dx = |z|~.\]
\end{align*}
\end{document}
</maths>

The macro processor would expand this (assuming %old was provided as true()) to

<?xml version="2.0" encoding="utf-8" ?>
      <?require validation="no" ?>
      <?require imply-end-tag="p chapter b" ?>
<?require imply-start-tag="div/p/text()" ?>
<chapter>
<div>
<p>This is some <b>text</p>
Looks like HTML α</div>
<p>Hello oldies</p>
<maths>
      <?require notation="text/tex"?>
\begin{document}
\begin{align*}
\[\int _0^\infty f\mathord(x+y\mathord) dx = |z|~.\]
\end{align*}
\end{document}
</maths>

The Fully Resolved XML Parser would parse this as if it were:

<?xml version="2.0" encoding="utf-8" ?>
      <?require validation="no" ?>
<chapter>
<div>
<p>This is some <b>text</b></p>
<p>Looks like HTML α</p></div>
<p>Hello oldies</p>
<maths>
      <?require notation="text/tex"?>
\begin{document}
\begin{align*}
\[\int _0^\infty f\mathord(x+y\mathord) dx = |z|~.\]
\end{align*}
\end{document}
</maths>
</chapter>

 0 ∞ f(x+y)dx=|z|.

(Source)
The notation expander processor could expand the TeX to MathML (as an example) as if were:

<?xml version="2.0" encoding="utf-8" ?>
<chapter>
<div>
<p>This is some <b>text</b></p>
<p>Looks like HTML α</p></div>
<p>Hello oldies</p>
<maths>
<math mode='display' xmlns='http://www.w3.org/1998/Math/MathML'>
  <mrow>
   <msubsup><mo>&#x0222B;</mo><mn>0</mn> <mi>&#x0221E;</mi></msubsup>
   <mi>f</mi><mo>(</mo><mi>x</mi><mo>+</mo><mi>y</mi><mo>)</mo> <mi>d</mi><mi>x</mi><mo>=</mo>
   <mrow><mo>|</mo><mi>z</mi><mo>|</mo></mrow>
   <mspace width='3.33333pt'/><mo>.</mo>
  </mrow>
</math>
</maths>
</chapter>

This validation processor would see that no validation is required in the processing chain, and strip the PI.

Finally the Post Processor could add some attributes, not needing to adjust the schema to get them, as if it were:

<?xml version="2.0" encoding="utf-8" ?>
<chapter customizedFor="oldies" date="2018-02-12" >
<div>
<p>This is some <b>text</b></p>
<p>Looks like HTML α</p></div>
<p>Hello oldies</p>
<maths>
<math mode='display' xmlns='http://www.w3.org/1998/Math/MathML'>
  <mrow>
   <msubsup><mo>&#x0222B;</mo><mn>0</mn> <mi>&#x0221E;</mi></msubsup>
   <mi>f</mi><mo>(</mo><mi>x</mi><mo>+</mo><mi>y</mi><mo>)</mo> <mi>d</mi><mi>x</mi><mo>=</mo>
   <mrow><mo>|</mo><mi>z</mi><mo>|</mo></mrow>
   <mspace width='3.333
33pt'/><mo>.</mo>
  </mrow>
</math>
</maths>
</chapter>

While all the examples have an XML form, I would expect that the pipeline would be after the Fully Resolved XML Processor just using an API in particular plain SAX and a PSVI-enhanced SAX. All the stages would be expected to be streaming, not relying on a tree being held in memory.

Grammar for Macro processor

ENTITY ::=  XML_DECL   BODY
BODY ::=  ( CHARS | EREF | NCREF | HNCREF | SECTION )*
SECTION ::=  SECTION_START  BODY SECTION_END
EREF ::=  "&"  NAME ";"?
NCREF ::=  "&#"   NUMBER ";"?
HNCREF ::= "&#x"  HEX ";"?
SECTION_START ::=  "<!["  s* ( KEYWORD | EXPRESSION ) s* "["
SECTION_END ::=  "]]>"
KEYWORD ::=  "INCLUDE" | "IGNORE" | "TEMP" | "CDATA"
CHARS ::= ...
NAME ::= ...
NUMBER ::= ...
HEX ::= ...
XML_DECL ::= ...
EXPRESSION ::= ...

Grammar for Fully Resolved XML Processor

This grammar is ambiguous for elements, if tag implication and delimiter omission in effect.  The default operation is that the ? are not in effect.

ENTITY ::=  XML_DECL   BODY
BODY ::=  ( CHARS | CREF  | ELEMENT | EMPTY_TAG | COMMENT | PI)*
ELEMENT ::=  START_TAG?  BODY? END_TAG?
CREF ::=  "&" ("lt" | "gt" | "apos" | "quot") ";"? 
START_TAG ::=  "<" NAME s* ( ">" ? | (ATTRIBUTE* s* ">"?)) 
EMPTY_TAG ::=  "<" NAME s* ( "/>" ? | (ATTRIBUTE* s* "/>"?)) 
ATTRIBUTE ::=  ( s+ NAME s* ("=" s* (LIT | LITA | NAME))? 
CHARS ::= ...
NAME ::= ...
LIT ::= ...
LITA ::= ...
XML_DECL ::= ...
COMMENT ::= ... 
PI ::= ...

Grammar for marked section conditionals. This is just a subset of ECMAScript.

EXPRESSION ::=     
|       ( "(" Expression ")" )
|   Literal
Literal ::=     ( <DECIMAL_LITERAL> | <HEX_INTEGER_LITERAL> | <STRING_LITERAL> | <BOOLEAN_LITERAL> | <NULL_LITERAL> | <REGULAR_EXPRESSION_LITERAL> )
Identifier ::=     <IDENTIFIER_NAME>
 
UnaryExpression ::=     ( Literal | ( UnaryOperator UnaryExpression )+ )
UnaryOperator ::=     (  "+" | "-" | "~" | "!" )
MultiplicativeExpression ::=     UnaryExpression ( MultiplicativeOperator UnaryExpression )*
MultiplicativeOperator ::=     ( "*" |  | "%" )
AdditiveExpression ::=     MultiplicativeExpression ( AdditiveOperator MultiplicativeExpression )*
AdditiveOperator ::=     ( "+" | "-" )
ShiftExpression ::=     MultiplicativeExpression ( ShiftOperator MultiplicativeAdditiveExpression )*
ShiftOperator ::=     ( "<<" | ">>" | ">>>" )
RelationalExpression ::=     ShiftExpression ( RelationalOperator ShiftExpression )*
RelationalOperator ::=     ( "<" | ">" | "<=" | ">=" | "instanceof" | "in" )
 
EqualityExpression ::=     RelationalExpression ( EqualityOperator RelationalExpression )*
EqualityOperator ::=     ( "==" | "!=" | "===" | "!==" )
LogicalANDExpression ::=     EqualityExpression ( LogicalANDOperator EqualityORExpression )*
LogicalANDOperator ::=     "&&"
LogicalORExpression ::=     LogicalANDExpression ( LogicalOROperator LogicalANDExpression )*
LogicalOROperator ::=     "||"
ConditionalExpression ::=     LogicalORExpression ( "?" AssignmentExpression ":" AssignmentExpression )?
AssignmentExpression ::=     ( LeftHandSideExpression AssignmentOperator AssignmentExpression | ConditionalExpression )
AssignmentOperator ::=     ( "=" | "*=" |  | "%=" | "+=" | "-=" | "<<=" | ">>=" | ">>>=" | "&=" | "^=" | "|=" )
Expression ::=     AssignmentExpression ( "," AssignmentExpression )*