Recently I have been looking at probabilistic schemas. For the sake of helping along the conversation, what untapped possibilities are there? Lets look at some ideas in increasing complexity:
Probabilitistic Pairs: This approach is just a table of each possible element, and the probability of the some relationship between one or the other. You might have a table for the XPath following:: axis, or for child:: or following-sibling:: and so on. You can use inductive logic on these pairs.
Bayesian Networks: Bayesian refers to systems where the probability of some event is determined from knowledge of other prior events. (An example of a non-Bayesian probability is a coin toss. The entertaining Monty Hall problem can be cast as a Bayesian problem.) For example, see this paper which gives the probability some element will contain text based on that element alone – i.e. probabilities for each x/text(), and probabilities that an element will contain some child based on the parent as well – i.e. probabilities for each x/y/z, if I read it correctly.
Probabilistic Grammars: These are schemas to which probabilities have been added for each child. To get a basic feel consider, consider this case where we take away any consideration of position or requirement from the grammar, just using Probablistic Mixed Content Models:
<!ELEMENT SECTION ( #PCDATA#0% | TITLE#100% | P#100% | TABLE#10% )* >
where the percentage indicates how many of each parent would contain that child in a large document corpus or average document. You can imagine the uses: if the percentage is very low you have identified outliers that might be better engineered out, if the percentage is 100% you might check that the schema enforces this, you might use this information to provide better editor prediction of what elements come next, and so on.
Hidden Markov Models: a hidden Markov model (HMM) is a two layer model (think states and recognizers, called latent variables and observations) and typically would be applied to make a schema out of the linear stream of SAX events, say, (i.e. the probability that start-tag for TITLE follows the start-tag for SECTION. The characteristic Markov Property is that the probabilities of the next state depend only on the current state and the observation made, not on previous observations. So a schema using HMM would really interested in typical sequences, on XPath’s following::*[1] and child::*[1]). It could be used to detect outliers of sequence (i.e. where a common child element appears in an unusual position), and for implying tags in sparse documents. When you have this model where the probabilities of observations from a state and of going from state to state are inplace, it can be used for all sorts of statistical inferencing about documents, especially for calculating what is the most likely missing markup.
A HMM schema would be most effective for schemas where there is no recursion or re-use
of element names in different contexts, and no element grouping or multiple occurrence
in a content model. So ( a, b?, c{1-10}, d*, e)
etc.
Context-Free Semi Hidden Markov Models: a HMM might be classed as a kind of state machine, but if we use a stack machine instead, allowing a push and pop as one of the action to go to the next state, (i.e. to return to a previous state) then we get a Context Free Semi-Hidden Markov Model (semi, because the Markov Property does not hold for pop operations.) So here we are looking at elements proper as a tree, not as a linear list of start- and end-tags. In this case, we can get a much closer implementation of the Probabilistic Mixed Content Models above: the latent variables are the LHS (element name) and the obvservations are the RHS (the content model).
Many variants are possible, and in fact using the parent as the latent variable might get rid of one of the nice features of HMM: that you can express probabilities that vary in different contexts (e.g. it rains more in Winter than Summer). So instead of each latent variable being the simple element, you could have one latent variable for each absolute path (in which case, actually you don’t need a stack and risk combinatorial explosion.) But more interesting would be to allow latent variables not based on explicit markup, for example that the first paragraph in a section has no likelihood of containing lists.
Neural Networks: It should be clear that if you relax the Markov property, you can start to get things that look more like neural nets.
Recursive Neural Networks: A recursive neural network (RNN) is a tree structure of networks. There is in fact a 2014 research paper on this: Self Learning Recursive Neural Networks for Structured Data Classification where they make a trie for the document corpus so that each unique absolute path in the corpus has a corresponding absolute path of neural networks, one per element.
The paper arrives at an interesting conclusion: “The results show that equipping RNN with semi-supervision mechanisms can enhance the classification accuracy of structured data” which relates to the training problem: training is expensive and time consuming and requires lots of examples, so if you can help with some upfront information (to label some documents as typical of known clusters) then that provides a headstart for calculations and outlier detection.
Readers: If you know any other research or ideas on probabilistic schemas, please let me know and I may update this article accordingly.
Feature detection and clustering are key techniques with many potential applications to XML processing, for large-scale industrial publishers with large corpuses. However, there has not been much research or public information about projects that do it. I think, in part, it is because the industry tends to have a reactive rather pro-active approach to these things: you fix a problem as it occurs rather than trying to second guess it upfront.