One of the reason for these, and indeed for the grouping element of “pattern” comes from my experience writing my book 20 years ago, The SGML & XML Cookbook. What I needed for that book was a way to express abstract patterns, which I could then present different concrete implementations. But DTDs (and the other schema languages of the day, such FrameMaker’s EDD) were not up to it. Even now, I think Schematron is the only mainstream schema language to take this as a primary focus.
So I am always interesting to see what kinds of patterns other people detect in documents. Dealing with Structural Patterns in Documents (Di Iorio, Peroni, Poggi, Vitali) from the excellent Professor Vitali’s group at University of Bologna is stimulating in this regard. It posits eight objective classes of elements (the permutations of whether it can have text nodes, whether it can have elements nodes, and whether it can have text siblings in its parent). It calls these:
- Container, with non-exclusive subtypes
Documents conform to this architect if each element only belongs to a single one of these objective classes. But if the same element can appear both in mixed and element content, say, that is a shift which means the analysis does not fit to that extent.
The Bolognese validate their theory against a corpus: data-oriented documents usually completely fit into this analysis, and the more freeform ones do mostly but with some shifting.