Hook

This article was formerly at http://xml.ascc.net/hook/ and is the version of February 5, 2002. The name Hook comes from a supposed hook shape of drawing this on a parse tree tracing previous-sibling then up the descendents.

A One-Element Language for Validation of XML Documents based on Partial Order

The Hook validation language is a thought experiment in minimalism in XML schema languages. The purpose of such a minimal language would be to provide useful but ultra-terse success/fail validation for basic incoming QA, especially of datagrams. It is like a checksum for a schema.

The validation it performs can be characterized as "Does this element have a feasible name, ancestry, previous-siblings and contents?", there being some tradeoff between the how fully the later criteria are tested.

Let us start with the following technical criteria:

Smaller than DTD: if it is downloaded from a server as a separate file, it should be downloadable in the first packet group, so less than 512 (the minimum MTU) -100 (for MIME header) =412 bytes.
Implementable by a streaming processor
No forward references
No pathological schemas as far as blowouts
An efficient implementation should be possible
Suitable for coarse validation of document for some significant issues
The schema should be namespace-aware
The minimal schema should only require 1 element or perhaps fit in a PI
The datatype should be expressible using XML Schemas regular expressions or simple space-separated tokens.
The schema paradigm is the (partial) ordering of elements against the information kept during stream processing

The Language

A Hook schema is an element containing a list of element names, some of which may be grouped by square brackets. This list represents a certain ordering of the names and validation consists of checking conformity to this ordering.

The DTD for the language is

<!ELEMENT hook:order ( #PCDATA)>
  <!ATTLIST  hook:order
        xmlns:hook  CDATA #FIXED "http://www.ascc.net/xml/hook"
        targetNamespace CDATA #IMPLIED
        friendly ( true | false ) "true"
        short (true | false) "false"
        top (true| false) "true"
 >

The targetNamespace attribute gives the namespace to be validated.
The friendly attribute is whether elements from other namespaces are allowed.
The short attribute is whether the all elements in the namespace have been mentioned or not; if not then unmentioned elements are allowed as if specified in a group at the end of the schema.
The top attribute specifies whether the first element in the schema must be the document element (or the local root of a branch starting this namespace).

The order element has the following grammar, where s is one or more whitespace (or string-start or string-end) and NCame is an XML name with no colons.

s ( (NCName "."? s )| 
        ( "[" s (NCname ("."|";")? s)+ "]" s ) 
         )+

The order element specifies an ordering of elements; element grouped by square brackets are in the same level or order.

Validation occurs by, for each element in the document proceding in document (streaming) order, checking that every previous-sibling element at the same level and then each ancestor element are ordered according to the list order (ignoring intermediate list items, but failing if there is no corresponding item in the schema to any element.) A name may appear more than once. (Actually, an implementation only needs to look a the first child and or next-sibling to perform validation, but explaining it this way around may make the syntax easier to understand.)

A fullstop (period) on an element indicates that the element may have no contents (no subelements and only zero-or-more whitespace characters ): this is almost the same as EMPTY or data content. A semi-colon indicates that the element is not recursive: the named element cannot immediately contain any elements from the same group. (It can still be followed by elements of the same group. ) A semi-colon in a group at the end of a schema thus indicates that simple content only is possible

Normally [ x y ] allows

<x><y/><x/><y/></x>
   <y><x/><y/><x/></y>

but [ x y; ] allows

<x><y/><x/><y/></x>

because the y does not contain x, but not

<y><x/><y/><x/></y>

because the y does contain x. And [ x y. ] allows

<x><y/><x/><y/></x>

but not

<y><x/></y>

So [ x y ] means

an x can contain any number of nested x and y before any other element
an x can be followed by any number of x and y before any other element
a y can contain any number of nested x and y before any other element
a y can be followed by any number of x and y before any other element
a y can be followed by any number of x and y before any other element

but [ x y; ] adds the constraint

a y cannot contain a y next (unless the next particle in the hook schema happens to be a y e.g. [ x y; ] x )
a y cannot contain an x next (unless the next particle in the hook schema happens to be an x, e.g. [ x y; ] x )

So ";" is used to break out of the recursion allowed in a [ ] group.

Intuitively, this is like first making a big list of every element allowed, putting them all in a choice group. This gives us a complete definition of every allowed element: it defines the namespace and catches spelling errors. Next, if there is some element(s) that can start, move them out to the front (or copy them if they can reappear. Now the schema validates the top-level elements too. Next, if there are some elements that can only appear as the last elements in a coment model ( e.g. the z in (x, y, z) or the b and c in ( a, (b | c)*) ) then move these out to a group at the end. Now we have validation for elements in simple mixed content. Continue factoring until done.

So given the following schema:

<hook:order>A B. C</hook:order>

then the following documents are valid

<A/>

  <A><B/></A>

  <A><C/></A>

  <A><B/><C/></A>

  <A><C><C/></C></A>

  <A><A/></A>

But not

<B/>

  <C><B/></C>

  <B><A/></B>

  <A><C/><B/></A>

  <A><C/><C/><B/></A>

 <A><B><B/></B></A>

It is quite possible that there are languages which exhibit orders that cannot be usefully captured. In those cases, a hook schema still can show the top element, all names in the namespace, and which elements must be empty.

Example

The following example is a hook schema for XHTML Basic

<hook:order targetNamespace="http://www.w3.org/1999/xhtml" >
  html head  [ title; meta. link. base. ]   body
  [ a br. blockquote caption; div  dl; h1; h2; h3; h4; h5; h6;  
        img. ol; p; pre; table; ul; ]  
  [ tr;  dt; dd; li; ]  td 
  [ a br. blockquote div  form img. ol; ul; li; ]  
  [ input; label; select; textarea; ]  [ option. ]
  [ abbr acronym address cite code dfn em kbd q samp span strong var object; ] 
  param 
 </hook:order>

This schema captures a lot of containment relationships OK, I think: probably it has some mistake. But it will not detect what may be a common XHTML problem, where omit-end-tag HTML elements like <body> are converted to <body />. However it will detect problems like <meta> not being converted to an empty tag and so spuriously including other head elements.

The next example is RSS.

<hook:order  targetNamespace="http://purl.org/rss/1.0/" >
  channel   title link image items item
  title link url description textinput.
 </hook:order>

A Hook schema for the well-known Purchase Order example would be:

<hook:order  targetNamespace="..." >
  PurchaseOrder  [comment; ShipTo; ]  Name Street City State Zip  
  ShipDate [ comment; Items; ] Item productName quantity price comment
 </hook:order>

This is a much more successful example! Note, every valid PO document will also be valid against this schema and that the schema validates all sequence requirements. What it won't catch is if an end-tag is in the wrong palce w.r.t what should be a sibling. So it seems that Hook may be good for validating datagrams of this kind.

Following is a schema for Schematron 1.5 (ISO Schematron has a different namespace).

<hook:order  targetNamespace="http://www.ascc.net/xml/schematron" >
  schema  ns title p phase active p pattern rule   [ assert; report; key.] 
  diagnostics  diagnostic [ name. dir; emph; value-of. ]
<hook:order>

Again, this is pretty good: there is a good amount of order to capture. The "diagnostics diagnostic" could also come before or or after rule

In all four cases above the character count is less than 400 characters, so it looks like they would be retrieve in the first packet group from a server.

Comments

Hook seems to suit languages that have large flat bottoms, languages with specific requirements early on in each content model, languages with specific elements that do not re-occur in different contexts with different priorities, languages with attributes that are not vital or will be checked by other mechanisms.

Hook would seem useful as a coarse-grained but ultra-terse validation language.

If we say that validation is to catch errors that are most likely to happen, the most likely errors are spelling errors, children in the wrong order, and required parents: Hook gets or catches most.

How much would this help an interactive editor? It would know which elements can start, but for new documents it would present to many choices: however if editing existing documents it would cull the available list pretty well, because it would know what the current level was. It would know empty elements.

It would be nice to signal order by < but too much markup would be required.

Joe English's Formal Definition of Hook

(XML-DEV mail list, 06 Feb 2001. I have edited in a correction Joe made the same day.)

Rick Jelliffe  wrote:

> And just to be (constructively) bolshie about it, I have come up with a new
> schema language "Hook" based on a new paradigm "partial ordering" which only
> uses one element.  See http://www.ascc.net/xml/hook  for the idea.

Very interesting! It looks like it meets all of the stated goals. Here's my attempt at a formal definition of hook-validity: [ Note: this is based on yesterday's draft -- Rick added some features to the language while I wasn't looking :-) Also, what follows doesn't address the 'friendly', 'short', or 'top' features; defining these formally seems straightforward. ]

In the "generalized binary tree" representation of an XML document, every node has at most two outgoing edges, the FIRST-CHILD and NEXT-SIBLING.

DEFINITION: The _hook_ of a node N is the (unique) sequence of nodes
    on the path from the root node to N in the generalized binary tree
    representation of the document.

    DEFINITION: A _hook schema_ is an (ordered) sequence of
    (unordered) sets of NCNames, specified by the following grammar:

        hook-schema     ::= items ;
        items           ::= item | item items ;
        item            ::= NCName | '[' ncnames ']' ;
        ncnames         ::= NCName | NCName ncnames ;

As a first attempt to define validity, we'll make a

TENTATIVE ASSUMPTION 1: a particular NCName may appear in at most
    one item in any hook schema.

Given a hook schema H which satisfies this assumption, we can define a function '# : NCName -> Integer' which maps an NCName to the number of the item in H in which the NCName appears, or -1 if the NCName does not appear in H. Under this assumption:

TENTATIVE DEFINITION 1: a document D is _hook-valid_ according to H
    if and only if for every hook 'N1, N2, ... NN' in D, the
    sequence of integers
        #(local-name(N1)), #(local-name(N2)), ... #(local-name(NN))
    is monotonically nondecreasing.

Or equivalently:

TENTATIVE DEFINITION 2: a document D is _hook-valid_ according to H
    if and only if for every node N in D,
        #(local-name(N)) <= #(local-name(FIRST-CHILD(N))
    and
        #(local-name(N)) <= #(local-name(NEXT-SIBLING(N))

    where we extend the definitions of local-name and # as follows:
        local-name(NIL) = "#EMPTY"
        #("#EMPTY") = +infinity
    to account for leaves and last siblings.

Under TENTATIVE ASSUMPTION 1, a hook-schema actually induces a "strict weak order" (in the C++ STL sense) on NCNames. (A strict weak order is stronger than a partial order but weaker than a total order; a relation '<<' on S is a strict weak order if there is a homomorphism from (S, <<) to (Z, <) where (Z, <) is a totally ordered set).

This is incorrect.  For h : (S, <<) -> (Z, <) to be a
homomorphism only requires that for all x,y:

        x << y ==> h(x) < h(y)

whereas for << to be a strict weak order requires the stronger condition:

        x << y <==> h(x) < h(y)

See <URL: http://www.sgi.com/Technology/STL/StrictWeakOrdering.html >
for an alternate definition.

However, TENTATIVE ASSUMPTION 1 does not hold in general since it's contrary to Rick's spec, which explicitly allows NCNames to appear more than once in a hook schema. Discarding this assumption, '#' becomes a relation rather than a function. Writing 'n # i' to mean that NCName 'n' appears in the i'th set of items in H, we have:

DEFINITION: A document D is _hook-valid_ according to H if
    and only if for all nodes X and Y in D,

        Y = FIRST-CHILD(X) or Y = NEXT-SIBLING(X) implies
        that there exists i, j such that local-name(X) # i,
        local-name(Y) # j, and i <= j.

This suggests an efficient implementation: since we only need to compare the _smallest_ i against the _largest_ j, define imin(H, N) to be the number of the first item in hook schema H in which NCName N appears, and imax(H, N) as the number of the last such item. 'imin' and 'imax' can be implemented as hash tables, and computed in a single pass directly from the hook schema. Validation can be performed SAX-style:

variable iprev : integer = -1;
    while not end-of-document():
        case next-event() of
            START-TAG(n):
                    if (imax(H,n)) < iprev then error "Invalid"
                    iprev := imin(H,n)
            END-TAG(n):
                    iprev := imin(H,n)

Regarding expressive power:

HYPOTHESIS: Every hook schema is equivalent to a tree-local/string-local grammar.

PROOF: (shouldn't be difficult...)

HYPOTHESIS: There exist tree-local/string-local languages which cannot be expressed as hook schemas.

PROOF: (the XHTML Basic 'body' element might work as a counterexample...)

By way of comparison: TREX and RELAX are equivalent to tree-regular/string-regular grammars, and XML DTDs are equivalent to tree-local/string-regular grammars. I'll have to think about the recently-introduced ';' and '.' features some more...

-Joe English

jenglish@flightlab.com

Joe English's Updated Validation Algorithm

(XML-DEV, 06 Feb 2001)

Anyway, on to postfix "." and ";":  the earlier DEFINITION
of hook validity can be rephrased as:

A document D is _hook-valid_ according to hook schema H
if the following two diagrams commute:

                     imin
                D ---------> Z
                |            |
   FIRST-CHILD  |            | <=
                |            |
                v            v
                D ---------> Z
                     imax

                     imin
                D ---------> Z
                |            |
   NEXT-SIBLING |            | <=
                |            |
                v            v
                D ---------> Z
                     imax


where (Z,<=) is the set of integers under the usual ordering,
and imin and imax are defined as described earlier.

To accomodate ";" and ".",  we can make a slight adjustment:

                   iparent
                D ---------> W
                |            |
   FIRST-CHILD  |            | <=
                |            |
                v            v
                D ---------> W
                   ioccur

                   isibling
                D ---------> W
                |            |
   NEXT-SIBLING |            | <=
                |            |
                v            v
                D ---------> W
                    ioccur

where W = the set of integers \union { +infinity }, and

ioccur(H,n)   = max { 2*i | n is in the i'th item in H }
isibling(H,n) = min { 2*i | n is in the i'th item in H }

iparent(H,n)  = +infinity, if "n." appears anywhere in H
              |  min (      { 2*i+1 | "n;" is in the i'th item in H }
                      union { 2*i   | "n"  is in the i'th item in H } ),
                 otherwise

Informally, "ioccur" (formerly "imax") specifies the largest context
in which an element may occur, "isibling" (formerly "imin") specifies
the context for siblings of an element, and "iparent" specifies
the initial context for children of an element.  "ioccur" and "isibling"
are always even numbers; iparent(n) will normally be the same as
isibling(n), but the ";" and "." operators modify this to isibling(n)+1
and +infinity, respectively.  "n." takes precedence if it appears
anywhere in the hook schema, otherwise the modifier attached
to the first occurrence of "n" takes precedence.