The W3 Standard for XML is now 20 years old. I sent original of this post to the XML-DEV mail list suggesting a different vision for XML: reconstruct SGML’s power but as a definite pipeline of simpler stages, but without DTDs or SGML Declaration. (This version: 2018-02-13)
Where is XML thriving?
- Industrial document production using semantic markup: the traditional SGML market of high complexity, multi-publishing, and long life span.
- Desktop office formats using XML-in-ZIP: ODF and OOXML
- Data that has mixed content
- Data that absolutely needs schema validation as it goes through a network of processing: such as HL7
Where is XML not thriving?
-
- SGML on the Web
- Configuration files (property files/JSON/YAML/etc)
- Minimal markup (markdown wins)
- Where the complexity of XSD etc does not provide a targetted enough bang per buck.
- Inefficiency. Processing pipeline often need to reparse the document. All processing is done post-parse, and consequently XML pipelines are unnecessarily heavyweight.
- It is not minimal. An even tinier version might be faster, easier to implement, neater.
- Maths users need named entities.
- Domain Specific syntaxes show that terseness should not be dismissed as being of minimal importance.
- DTDs were simplified and made optional on the expectation that a downstream process could then do the validation: however, only Schematron has a standard format for reporting types (linked by XPath) and errors. There is no standard XML format for the XSD PSVI.
- Namespaces are crippled because they are not versioned: and processing software does not support wildcarding or remapping. Consequenty, changing the version of the namespace breaks software.
- XML element or attribute names, PIs and Comments strings that containing non-ASCII characters may be corrupted by transcoding out of Unicode.
- A bit too strict on syntax errors.
- XSD tried to elaborate every possible use of parameter entities and tease them out into separate facilities. It did not reconstruct several major ones, notably conditional sections. This has the consequence of reducing XML’s efficiency as a formatting language.
- XInclude only partly reconstructed the functionality of general entities.
- The XML specification sections on how to represent “<” in entity declarations gives me a headache.
- Little domain specific languages have not gone away: we have dates, TeX maths, URLs, JSON, CSV, and so on.
- XSLT is becoming more complex and full-featured with the result that there must be fewer complete implementations. Because there is no-where else, to go, it has needed to add support for JSON, streaming and database-query-influenced XPaths.

- Start with standalone, DTD-less XML.
- Build-in standard MathML/ISO Public Entity sets
- Support namespace mapping, to cope with versioning better, using ISO NVRL
- Allow modern validation with ISO DSDL or XSD
- Allow conditional marked sections, a superset of SGML, with m4-sized logic expressions, for versioning and customizable documents
- Allow embedded “little languages” converted to XML infoset, a better version of SGML’s SHORT-REF
- Allow value-defaulting and post-validation augmentation of the info set, similar to SGML LINK PROCESS
- Allow more flexibility in tagging and error recovery, in particular end-tag implication and close delimiter ommission.
- Allow character references in markup.
- All meta files would use XML element and PI syntax.
1) “XML Macro Processor” Full featured macro-processor, taking the features of M4: text substitution, file insertion, conditional text. Just before the advent of XML, Dave Peterson had proposed to the ISO committee enhancing the marked section mechanism with better conditional logic (nots, and, or, etc), so this is not a left-fielddea. (This is an enhanced standalone version of what SGML calls the “Entity Manager”. )
- Read XML header and handle transcoding to Unicode.
- Parse for <![ and subsequent ]> (or ]]>)on stack, and perform macro expansion and interpretation. I.e. strip DOCTYPE declaration, perform inclusions, don’ pass on sections, delimit data in CDATA sections or CDATA entities to text with numeric character references. The value of variables of marked sections (while looking like PE references i.e. %aaaaa;) do not have their definition taken directly from the prolog but must be provided out of band, i.e. as an invocation configuration. (This is a “hygenic” macro processor, because macros cannot be defined in the document and therefore risk complicated meta-macro-hacking.)
- Expand general entity references to direct unicode characters. Entity references (while looking like General entitity reference ie &aaaa; ) are not defined in the prolog by must be provided out of band in some Entity Declaration Document. The standard ISO/MathML entity sets are predefined. (For safety, a “&”, “&/”, “&?” etc at the end of an entity, or a “>”, “/>” etc at the start of an entity would always be converted to the built-in named character reference: this would reduce the chances of strange hacks?)
Benefits:
- Allows major simplification of the XML processor.
- Support lightweight customizable documents, without having to load a whole document tree.
- Reconstructs SGML’s marked section mechanism
- Removes the vexed issue of people who want to use XSD and named character references
- Optionally supports “;” ommissibility on entity and numeric character references, a la SGML.
- Documents can be transcoded without corrupting non-ASCII characters in names, PIs and comments.
- The macro processor removes the need for parameter entities, because it can be used on a schema or other XML document. And it provides a way of customizing schemas using a general mechanism.
Incompatabilities:
- Entity and numeric character references will be recognized where they currently are not.
- Edge cases will exist, such as where an attribute value contains <![ it will be recognized.
- Marked sections is not defined as synchronised with element tags, which could allow various hacking problems. (Implementations are not required to support marked sections in attributes or that are asynchrous to the element tagging and such markup is deprecated and unsafe.)
- Recognize start-tags, end-tags, comments and PIs.
- As error-handling, may allow STAGC ommission like SGML and HTML <p<b>
- As error-handling, may allow start- and end-tag impliciation, using a Error Handling Specification document, like SGML and HTML.
- An entity reference would be an undeclared error.
- A numeric character reference would be accepted but generate a warning.
- The input is the ultra minimal XML that some have been calling for. Rather than “simplifying XML” by abandoning docheads, we refactor XML to support both docheads and people wanting a minimal XML.
- Conforming subset of current XML
- Compatible with SAX
- Allowing minimization and tag implication may be an incompatability, but it would be an error handling feature that does not need to be enabled.
- Short refs had three kinds of use cases:
- first was for repetative tabular data, such as CSV, where the newline and , or | characters could be expanded and recognized. This use case would be supported.
- second was for embedded little languages, for example for mathematical notation. However, the absense of a mechanism to declare infix shortrefs meant that this was crippled. This use case would be supported
- third was for markdown-style markup. This is not a supported use-case, as there is a thriving markdown ecosystem and community doing fine without it, and because of the issue of double delimiting
- Support some simple parsing tasks that otherwise might require heavyweight XSLT, but do it within a more targetts regex framework.
- Compatible with SAX
- Support attribute defaulting taking over from DTD. RELAX NG and Schematron per se do not alter the document stream.
- Reconstruct the LINK feature of SGML, that allows bulk addition of attributes (such as formatter properties), reducing the attributes needed to be marked up or in the schema. Allows process-dependent attributes to be added on the fly.
- Supports feature extraction and markup. For example, a Schematron processor could be made that injects into the event stream extra attributes based on the new sch:property capability of Schematron 2015.
- Support some simple decoration tasks that otherwise might require heavyweight XSLT.
- Compatible with SAX
Example
<?xml version="2.0" encoding="Big5" ?> <?require validation="no" ?><?require end-tag-ommission="p chapter b" ?><?require imply-start-tag="div/p/text()" ?> <chapter> <div> <p>This is some <b>text</p> Looks like HTML &@alpha</div> <![ %old [ <p>Hello oldies</p>]]> <![ not(%old)or %teen [ <p>Hello kids</p> ]]> <maths> <?require notation="text/tex"?> \begin{document} \begin{align*} \[\int _0^\infty f\mathord(x+y\mathord) dx = |z|~.\] \end{align*} \end{document} </maths>
The macro processor would expand this (assuming %old was provided as true()) to
<?xml version="2.0" encoding="utf-8" ?> <?require validation="no" ?><?require imply-end-tag="p chapter b" ?> <?require imply-start-tag="div/p/text()" ?> <chapter> <div> <p>This is some <b>text</p> Looks like HTML α</div> <p>Hello oldies</p> <maths> <?require notation="text/tex"?> \begin{document} \begin{align*} \[\int _0^\infty f\mathord(x+y\mathord) dx = |z|~.\] \end{align*} \end{document} </maths>
The Fully Resolved XML Parser would parse this as if it were:
<?xml version="2.0" encoding="utf-8" ?> <?require validation="no" ?> <chapter> <div> <p>This is some <b>text</b></p> <p>Looks like HTML α</p></div> <p>Hello oldies</p> <maths> <?require notation="text/tex"?> \begin{document} \begin{align*} \[\int _0^\infty f\mathord(x+y\mathord) dx = |z|~.\] \end{align*} \end{document} </maths> </chapter>
(Source)
The notation expander processor could expand the TeX to MathML (as an example) as if were:
<?xml version="2.0" encoding="utf-8" ?>
<chapter>
<div>
<p>This is some <b>text</b></p>
<p>Looks like HTML α</p></div>
<p>Hello oldies</p>
<maths>
<math mode='display' xmlns='http://www.w3.org/1998/Math/MathML'>
<mrow>
<msubsup><mo>∫</mo><mn>0</mn> <mi>∞</mi></msubsup>
<mi>f</mi><mo>(</mo><mi>x</mi><mo>+</mo><mi>y</mi><mo>)</mo>
<mi>d</mi><mi>x</mi><mo>=</mo>
<mrow><mo>|</mo><mi>z</mi><mo>|</mo></mrow>
<mspace width='3.33333pt'/><mo>.</mo>
</mrow>
</math>
</maths>
</chapter>
This validation processor would see that no validation is required in the processing chain, and strip the PI.
Finally the Post Processor could add some attributes, not needing to adjust the schema to get them, as if it were:
<?xml version="2.0" encoding="utf-8" ?> <chapter customizedFor="oldies" date="2018-02-12" > <div> <p>This is some <b>text</b></p> <p>Looks like HTML α</p></div> <p>Hello oldies</p> <maths> <math mode='display' xmlns='http://www.w3.org/1998/Math/MathML'> <mrow> <msubsup><mo>∫</mo><mn>0</mn> <mi>∞</mi></msubsup> <mi>f</mi><mo>(</mo><mi>x</mi><mo>+</mo><mi>y</mi><mo>)</mo> <mi>d</mi><mi>x</mi><mo>=</mo> <mrow><mo>|</mo><mi>z</mi><mo>|</mo></mrow> <mspace width='3.333 33pt'/><mo>.</mo> </mrow> </math> </maths> </chapter>
While all the examples have an XML form, I would expect that the pipeline would be after the Fully Resolved XML Processor just using an API in particular plain SAX and a PSVI-enhanced SAX. All the stages would be expected to be streaming, not relying on a tree being held in memory.
Grammar for Macro processor
ENTITY ::= XML_DECL BODY BODY ::= ( CHARS | EREF | NCREF | HNCREF | SECTION )* SECTION ::= SECTION_START BODY SECTION_END EREF ::= "&" NAME ";"? NCREF ::= "&#" NUMBER ";"? HNCREF ::= "&#x" HEX ";"? SECTION_START ::= "<![" s* ( KEYWORD | EXPRESSION ) s* "[" SECTION_END ::= "]]>" KEYWORD ::= "INCLUDE" | "IGNORE" | "TEMP" | "CDATA" CHARS ::= ... NAME ::= ... NUMBER ::= ... HEX ::= ... XML_DECL ::= ... EXPRESSION ::= ...
Grammar for Fully Resolved XML Processor
This grammar is ambiguous for elements, if tag implication and delimiter omission in effect. The default operation is that the ? are not in effect.
ENTITY ::= XML_DECL BODY BODY ::= ( CHARS | CREF | ELEMENT | EMPTY_TAG | COMMENT | PI)* ELEMENT ::= START_TAG? BODY? END_TAG? CREF ::= "&" ("lt" | "gt" | "apos" | "quot") ";"? START_TAG ::= "<" NAME s* ( ">" ? | (ATTRIBUTE* s* ">"?)) EMPTY_TAG ::= "<" NAME s* ( "/>" ? | (ATTRIBUTE* s* "/>"?)) ATTRIBUTE ::= ( s+ NAME s* ("=" s* (LIT | LITA | NAME))? CHARS ::= ... NAME ::= ... LIT ::= ... LITA ::= ... XML_DECL ::= ... COMMENT ::= ... PI ::= ...
Grammar for marked section conditionals. This is just a subset of ECMAScript.
EXPRESSION ::= | ( "(" Expression ")" ) | Literal Literal ::= ( <DECIMAL_LITERAL> | <HEX_INTEGER_LITERAL> | <STRING_LITERAL> | <BOOLEAN_LITERAL> | <NULL_LITERAL> | <REGULAR_EXPRESSION_LITERAL> ) Identifier ::= <IDENTIFIER_NAME> UnaryExpression ::= ( Literal | ( UnaryOperator UnaryExpression )+ ) UnaryOperator ::= ( "+" | "-" | "~" | "!" ) MultiplicativeExpression ::= UnaryExpression ( MultiplicativeOperator UnaryExpression )* MultiplicativeOperator ::= ( "*" || "%" ) AdditiveExpression ::= MultiplicativeExpression ( AdditiveOperator MultiplicativeExpression )* AdditiveOperator ::= ( "+" | "-" ) ShiftExpression ::= MultiplicativeExpression ( ShiftOperator MultiplicativeAdditiveExpression )* ShiftOperator ::= ( "<<" | ">>" | ">>>" ) RelationalExpression ::= ShiftExpression ( RelationalOperator ShiftExpression )* RelationalOperator ::= ( "<" | ">" | "<=" | ">=" | "instanceof" | "in" ) EqualityExpression ::= RelationalExpression ( EqualityOperator RelationalExpression )* EqualityOperator ::= ( "==" | "!=" | "===" | "!==" ) LogicalANDExpression ::= EqualityExpression ( LogicalANDOperator EqualityORExpression )* LogicalANDOperator ::= "&&" LogicalORExpression ::= LogicalANDExpression ( LogicalOROperator LogicalANDExpression )* LogicalOROperator ::= "||" ConditionalExpression ::= LogicalORExpression ( "?" AssignmentExpression ":" AssignmentExpression )? AssignmentExpression ::= ( LeftHandSideExpression AssignmentOperator AssignmentExpression | ConditionalExpression ) AssignmentOperator ::= ( "=" | "*=" | | "%=" | "+=" | "-=" | "<<=" | ">>=" | ">>>=" | "&=" | "^=" | "|=" ) Expression ::= AssignmentExpression ( "," AssignmentExpression )*