Common Markdown Specification: so when is it a good idea to throw theory out?

Posted on February 21, 2018 by Rick Jelliffe

My astonished friend Nick forwarded me a PDF-ed version of the GitHub Flavored Markdown Specification. This is a version of the Common Markdown Specification with a small number of extensions. It came to 164 pages. You heard that right, 164 pages, for a spec for a small language.

How could that be? The ISO Schematron specification is only 36 pages excluding frontmatter. And it specifies ISO SVRL as well. W3C’s PDF version of XML is 59 pages. JSON is notoriously one or two pages.

So I took a look. I would say the problem is not that it is a 164-specification. The problem is that it is kinda a 0-page specification!

It has a five page appendix describing in English a parsing strategy, as a list of when you see this, do that. And some paragraphs here and there. But most of the “specification” is a test suite: when you see this, it should be transformed to that HTML. It has no grammar, no BNF, no productions, and so on. No formalisms. Parsing theory or tools are kept in the toolbelt.

SGML was widely criticized for having grammars but not falling into a popular class of formal language (though the problem was the limited theory of the day): XML development paid a lot of attention to being parseable top-down (by a shadowy figure called the Desperate Perl Hacker).

Why is this a problem? In one sense it is not: the world is better with these documents than without, and the authors should be commended for gifting the world with them, and the authors are completely entitled to make their own decisions about what is best given the characteristics of Markdown, what would be useful for the intended readers, and what the most elegant way to capture the requirements are. I don’t have any criticism of anything relating to that. It’s all cool, bro.

But I think it is fair to judge a specification based on the criteria it sets itself. The Specification says

This document attempts to specify Markdown syntax unambiguously.

In order to specify something unambiguously, we need to know that the specification is complete. Specifications are ambiguous because they are incomplete: there is some unintended or unexplained edge case. The most usual way of ensuring that a specification is complete is to use some formalism which is known to be complete. (Or you need to show that you have exhaustively treated every atom and combination to some depth, somehow. I think the only other way to know that a specification is complete is if it is mature and widely tried and has no outstanding issues: it is effectively complete.)

This is the key difference between Schematron and grammar-based validators like DTDs or XSD: a Schematron schema may be open and only cope with particuar patterns of interest; a grammar is closed (though it may have wildcards).

So without using some formal notation, all the extra tests do is provide incrementally better test cases, with no guarantee that they actually are complete. Maybe they are, maybe they are not. Who knows? And they don’t necessarily help a developer, who would need to coalesce the rules into some grammar-like form. And there are too many of the tests for a regular user to take in, I suspect.

But the problem with formalisms is

non-professionals don’t understand them. (Or am I saying that if you are fluent in the formalisms for your domain you are just a hacker? Perhaps a little too provocative if not hypocritical of me.)
technologies that are have accumulated hacks and quirks over the years do not necessarily lend themselves to elegant specification with what we might hope would be the correct theoretical class. (HTML 5 and SGML being good examples of this.)

Doable?

Now actually I think it would be quite possible to define Common Markdown using a grammar. (In fact, I did this with a simple precursor of it, for the Wiki markup language: using SGML as the formalism: see 2014’s From Wiki to XML using SGML. ) But the problem would be that for Common Markdown, I expect it could only be a “generative grammar“: in other words, you could use the grammar to generate a legitimate Common Markdown document, and you could take a Common Markdown document and construct the parse tree for the grammar by hand, but you could not use it directly to parse the document without knowing more rules outside the model.

This is because, as far as I can see, you would need to have an ambiguous grammar to describe Common Markdown. This is a grammar where multiple paths are possible, and you decide which one to take, perhaps after exploring each one or picking the first shortest one or the longest one or whatever. (There are tricks to make this more tractable in many cases, such as RELAX NG implementations’ typical use of derivatives. But they rely the other paths failing to match eventually, rather than on allowing particular equally-possible paths to be used, which is what Markdown would seem to need.)

The reason I think this is because Common Markdown’s second pass relies on making a stack of delimiter positions, then starting at the top of the stack, going back and forward accepting some, rejecting (i.e. using the text value) of the others.

But Useful?

But would a grammar still be useful? I think so. Even if it were just the generative, ambiguous grammar, it would help identify the transforms that are regular and those that are not: it would help see the woods for the trees. At the moment, these specifications are little but trees.

When I see a specification like this, I feel mixed emotions. Primarily I feel happy that the information is out there and in a nice regular form: organized chaos is better than disorganized chaos! But overlaid on this is a little dispair or dread that the opportunity provided by SGML, which was to be able to define domain-specific markdown-like languages in ways that then allowed them all to be brought into a common technical eco-system, treated by common tools, validated by syntax-independent tools, and for the syntax to be no barrier to incorporation in with other documents, is dead. Watch me shed the tiniest of tears.

We are all more stupid for it, and we are reduced to describing syntaxes in words and examples.

But history is what it is: SGML was not ultimately up to the job because of its limitations, complexity and lack of evolution as XML took the limelight. And there is enormous value in specify the transform in terms of HTML, which can be validated: I admit SGML and XML had a problem without actual definitions of the result of parsing (which is one reason for Schematron’s SVRL): SGML had to have the ESIS specification tacked on, and XML in course got its XML Infoset.

SGML is what you get when you have to write a compiler compiler in FORTRAN. HTML is what happens when you hack to hardcode an SGML-like application in C. Markdown is what you get when you hack together a markup language in Perl.

I think it is telling that when I tried to find an example of this expected ambiguity, I twice got (mentally) lost looking through the multiple examples in the specification and gave up. So many details.

[UPDATE: Dale Waldt wrote an article My Love/Hate Relationship with MarkDown last week, which is what prompted Nick.]