Lightweight inline schemas above structs?

The last thing I expect the world cares about is another schema language for XML! But the XML ecosystem has had a lot of challenges with JSON. XML comes from the markup world where your documents are made by domain experts not programmers, with the intention of abstracting away issues of representation and formatting; JSON is designed for programmers with the intention of specifying exactly the storage for data items. Sp they are very different.

XML's core problem in this is that it does not have enough syntax. There are not enough delimiters to specify different datatypes, so you have to go to an overlaid format to specify such things: DTDs, XML Schemas, Schematron etc. Even then, these are not aimed at specifying arbitrary C-style structures, but slightly higher level structures. Moreover, what they achieve can be fairly arbitrary: why cannot any XML schema language specify that some element contains unordered data for example? That kind of question is arbitrarily regarded as semantics, and good for RDF or whatever, yet is it precisely the information that a receiving system dynamically selecting the class to use to create objects would need.

So when we separate style from content, then schemas from content, we get a triple layering that is ripe for becoming heavyweight. As indeed happened with XML Schemas (and indeed with RELAX NG and even Schematron when compared against JSON.)

So here is an idea for a lightweight inline schema language for higher-than-struct level databinding. Just to introduce the category.

I use PIs with the following structure <?whatis path type regex ?>

Here is an example (don't fixate on the HTML mistake please!);

 
<?xml version="1.0"?>
<?whatis xhtml                     bag "head body"                  ?>
<?whatis head                      bag "('title'|'meta'|'script')*" ?>
<?whatis a/@href                   url "*"                          ?>
<?whatis /xhtml/head/meta/dc:date  date  "*"                        ?>
<?whatis form/@method              string   "'put'|'post'"          ?>
<xhtml ...>
  <head>
  .....
</html

So the path is a simple ancestor axis. It would not require an XPath implementation, it could be simply matched by converting the names into the ancestor stack into a / separate string and doing a string compare.

Similarly, the content model for a bag, set or other type of element content would be tested by treating it as a regular expression that runs on the space separate concatenation of the names of the child elements. (This is the approach we can using in Schematron to test content models using regular expressions.)

The types would be the JSON datatypes, the remaining XML Schemas datatypes, along with, say, element content types mixed, bag, set, array, array, empty. And that's it really.

Anyway, this would give the ability to bind elements to generic classes better than JSON does (not that this is JSON's purpose), at trivial parsing cost and without a complex external schema language. The path allows better selection than DTDs, while the regex allows limited validation and edit-assistance. It lets a developer specify or know that to access some element's content, you can use the platform's Bag collection class or Date class, and so on.

The important thing is that all that is needed is the element stack, the child list and a regex parser, so it could be implemented trivially and efficiently.