Technologies come and go, and technologies that don’t change die!

That is was the lesson of SGML, where XML took over because the standards process actively worked to stifle change. And it was the lesson of XML, where JSON took over the ephemeral web traffic because the W3C process couldn’t allow the change in layering required (instead of simplifying typing by using delimiters, it added XSD and namespaces etc, getting so complex it alienated the core intended user: the Desparate Perl Hacker.)  And now it seems that JSON is itself being overturned, this time not because of the inflexibility of ISO or W3C but of Douglas Crockford (definitely with no criticism of him, or ISO or W3C intended.)

This reflection is prompted by reading a series of articles on how JSON is not good for configuration files: notably  Why JSON isn’t a Good Configuration Language, which brings out that Douglas Crockford suggests preprocessing JSON in order to allow comments.  McCombs identifies two things that JSON lacks (but which that XML has) as problems:  no comments, no multiline strings. He also names a couple of things that are, to me, pretty subjective: first that JSON numbers don’t allow NaN (why would you need this in a config file: makes no sense to me), and second that JSON supposedly has a high signal to noise ration: I don’t see why outer {}  constitute noise in any significant way: surely the tradeoff with being a JavaScript subset has so much more value?  McComb makes another comment about not liking the strict syntax, and wants more readability: but readability normally means redundancy: more sugar not less noise: I don’t think it is a good case as presented, but it represents a legitimate POV all the same.  (Of course, this on top of JSON’s problem that despite the one-page specification, many implementers have made mistakes with edge cases, making it less reliable than unpleasantly strict XML.)

So it is very important that the instigators of a technology (like Goldfarb, Crockford, etc) don’t go from being the caretakers and dictators-for-life to being the technicians maintaining the life support system of a technology in a vegatative state.  Maybe that is being too hard: yes, a technology needs a champion to succeed, and yes the grassroot supporters of a technology want it because it is more convenient for them and don’t want to be roped into being maintainers and standards champions etc.  But there always needs to be a next generation to take up the challenge.

But IMHO the only technologies that can be stable are the ones that are based on enumerations that run out of code space: ASCII is living and fixed only because we don’t have any more code points left: Unicode is living because it still has room to grow.  Otherwise a technology that could grow and won’t must be augmented or replaced by some that will.   That’s evolution: we already see the effect of parallel evolution where when some technological features are superseded, some other technology is “invented” that recapitulates the feature: XML removed the SGML SHORTREF feature, and low and behold we got Markdown, YAML and JSON and so on.

SGML is “dead” not because it is dead but because XML is dead (though the corpse of XML still has several decades more life in it).  Indeed, for the last year I have been working on a project where the customer still was using SGML:  SGML’s corpse is not completely dissolved into the earth yet, with Mark Twain’s  🙂

XML is thriving for the projects that SGML was the technology of choice for. (And Schematron is still going pretty well in that niche too!)   It missed an opportunity to grow towards JSON, and in particular because the people who were calling for change focussed on complexity in the wrong areas: lexical complexity. Instead of improving the XML syntax in a backwards compatible fashion (allowing   for example   <a  b=123.0  c=true d=”a string”/> to pull in numbers and booleans) they wanted to get rid of PIs (and comments, in some cases): XML needed (and still needs) more sugar not less.   (But isn’t XML supposed to be a profile of SGML?  Yes, but that does not mean that it could not be a smarter profile of SGML, which produces a different ESIS from the markup, e.g. were XML says that attribute delimiters are not significant for determining type.)

SGML was made by people seeing a value in amalgamating all the features of current markup languages into one dynamic parser generator, with a lexical configuration (SGML Declaration) and grammar specification (DTD) and post-processing specification (LINK processor).  This was too complicated for simple use, so many implementers (such as the RISP processor I was involved in in 1990 in Japan) implemented a subset.  Along came HTML, then XML, then it all complexified with so many layers.  HTML unlayered by the HTML 5 effort (for better or worse), but XML never did or could.  So along came JSON (which is good for data transmission, not but unthinkable for publishing industry documents).

And now, the wheel has turned and we have this plethora of alternatives, as well as the various markdown languages.  Tennebaum’s comment that “the great thing about standards is there are so many to choose from” is (apart from being insanely ignorant of the purpose and nature of voluntary industrial standards) being revisited on us as “the great thing about non-standard is there are so many more to choose from“!

At some stage, will we see SGML 3.0?  (If we can call WebSGML aka XML as SGML 2.0)  Something that can cope with the rules of HTML 5, Markdown, JSON, YAML plus XML?  Let along TOML and HOCON.

I guess it goes back to the central premise of SGML: that people will never agree on syntax and structures, but not because of bloody-mindedness but because their legitimate requirements are particular.  So what is needed is not agreement on lexical rules or syntax or structures, but on the classes of language required, and the way to declare them, and the information set that will result for parsing.   This was the approach that SGML took, and it meant that SGML was more akin to, say ANTLR or JavaCC than casual users would expect or require.

I think this central premise holds good: but does the value proposition of SGML still hold good?  That if we have agreement on these meta-details, then you can take your documents and I can take mine, and we can run them into a common front-end and then process them (as DOM trees or SAX events streams or graphs etc) with the same conventions. And, more particularly, we can come back in 30 or 50 years and rerun the same data (this is important for weapons systems for example.)

  • I think what we see instead is that developers prefer to use multiple front end systems rather than a generic one that requires a lot of configuration.
  • And people with documents with extremely long lives (e.g. legal publishers) are prone to converting all their data to a new DTD/Schema every few decades: one of the value propositions in switching to XML from SGML was it gave the opportunity to have this change of generation.
  • Furthermore, it is not clear that now there is an advantage to having non-developers write SGML declarations, even though it may be declarative: developers are more comfortable with writing code.  However, the current efflorescence of programming languages does mean that having a way of declaring multiple languages in common is more effective.
  • I think SGML )and XML ,etc) also underperforms because it is too hard to integrate into text editors: certainly too hard for a TextMate grammar (Topologi’s editor had a more forgiving form of XML that would cope with some kinds of SGML.  The big news here is the IDE  Language Protocol Server system, where an editor talks to a smart parser using a standard protocol:  this is a significant missing piece that (had it been available 20 years ago!) would have been a game-changer for SGML and XML.

So we have no community calling or requiring a unified parser for languages of multiple markup styles.   No community demand means no standard.   But I fully expect when the stars align with other technology, there will be some new standard that moves into this space, finding some squeeze room with XML, JSON, etc.