XML Notation Schemas

This article was writted at the ASCC, Academia Sinica, Taiwan.

XML Notation Schemas

Rick Jelliffe
1999-06-26

This note is for discussion purposes by the W3C Schema Working Group. It provides an alternative characterization of the schema problem. This provides a framework for addressing many issues not handled in the first working draft of the XML Schema specification.

Definitions

schema:

A structured group of declarations of the notations and their refinements used by document objects in a particular document type definition.

declarations:

Declarations are the formal expression of some information about a group of document objects: in particular declarations define rules over various domains, associate external identifiers with the terminals an non-terminals of the grammars, define and name lexical tokens and higher-level structures, and provide mechanism for analysing, designing, deploying and explaining the information.

document type definition (DTD):

Not to be confused with "XML markup declarations". A document type definition is a set of schemas, which may use different notations, together with informally expressed information.

notations:

A notation is a name given to the grammar used for a fragment of a document. Notations are layered and interwoven through a document: the various levels of character encoding are notation layers, XML is an notation kayer, the additional requirements of XHTML on XML are notation layers, an ECMAScript script in an XHTML document is another notation layer, a BIN64 encoded fragment of text inside that script is yet a further notation. It can be seen that an XML document has a tree of notations, which follows the element and entity structures.

refinements:

Any notation layer can be refined, to constrain the notation further. A notation cannot be relaxed without running the risk of breaking software which relies on the original definition of the notation. Extensible software is software written so that refinements of lower layer notations will not cause the notation-handlers of higher-layers to break.

document objects:

Document objects are defined in DOM. The combination of the document objects and the schema objects makes up the information set of the implementation of a markup language.

Criteria

XML Schemas should be able to provide enough information to allow:

an application to know whether it can accept documents conforming to that schema (e.g., a MIME browser);
a generator to create documents conforming to that schema (e.g., a structured editor);
an application to test a document to see that it does in fact conform to the schema it says it does (e.g. a validator);
an application to restructure a document into the form required by a schema system (e.g. to convert data to RDF);
an application to use the data (e.g. a hypertext application).

As for general test cases for XML Schema:

it can describe XHTML;
it can describe RDF;
it can describe itself;
it can describe hypertext;
it can describe simple database records (I follow XML Schema: datatypes);
it can describe literature (I follow XML markup declarations).

Furthermore, XML Schemas must provide a framework for maximum extensibility, to allow vendors to compete on depth and quality of coverage.

Scenario

An XML Notation Schema

is an XML document;
may be large in comparison to instances conforming to it;
is therefore unsuitable for download to end-users;
will be transformed into XML 1.0 markup declarations for download to end-users;
will be fragmented and transformed into other formats for use by applications making sense of embedded data;
will be downloaded as-is for use by non-end-user applications.

In other words, XML Notations Schemas build on top of XML markup declarations strengths:

simplicity;
terseness;
validity provides useful pre- and post- conditions for document systems;
already exists and deployed.

Foundation

An XML Notation Schema is a set of multi-layered grammars. The grammars may be

partial, only applying within certain ranges,
functional, resulting in parsed or transformed data, and
overlapping, applying
schema-defined (refined) or built-in.

The grammars can apply against any kind of document object.

Notation Declaration

A grammar is known by a notation identifier. It is the funamantal declaration. I follow a revised form from XML Schema.

<!ELEMENT notation ( html:body?, ( %grammar; )*, handler*, parameter*) >
<!ATTLIST notation
  name   ID #REQUIRED
  system CDATA #IMPLIED
  public CDATA #IMPLIED
  mime   CDATA #IMPLIED >

A notation identifer has several possible names according to different schemas.

There following notations are built-in:

all MIME media types;
all DOM document objects;
all XML Schema:Datatypes, including the lexical expressions and regular expressions;
some other syntaxes, used below.

A handler is some downloadable program which can deal with the notation. The html:body element is provided for documentation.

Rule

A rule is a boolean expression where each of the terms is a "grammar". This allows composing of grammars into complex rules.

...TO BE COMPLETED...

Grammars

A grammar defines a notation layer. (Note: For simpler explanation I have not included archetype features here: it should be easy to see how they can be added.)

<!ENTITY % grammar " regularExpression | lexicalModel | contentModel 
                       | contextModel| BNF | anonymous | userDefined ">

<!ELEMENT anonymous EMPTY>
<!ATTLIST anonymous
   type    ( empty| any| mixed| element| rdata | pcdata | unique |
            single| pair| pairs| same | leaf| leaves| nonrecursive ) > 
         classes NOTATION  "xml:POSIX.regex"
   tokenizer IDREF   "xml:xml.character" >

<!ELEMENT regularExpression ( #PCDATA )>
<!ATTLIST regularExpression
   classes NOTATION  "xml:POSIX.regex"
   tokenizer IDREF   "xml:xml.character" >

<!ELEMENT lexicalModel ( lexical* )>
<!ATTLIST lexicalModel 
   classes NOTATION  "xml:schema.lexical-types" 
   tokenizer IDREF   "xml:xml.character" >
<!ELEMENT lexical  ( #PCDATA )>

<!ELEMENT contentModel ( #PCDATA  )>
<!ATTLIST contentModel 
   classes NOTATION  #IMPLIED  
   tokenizer IDREF   "xml:dom.element-nodes" >
<!ELEMENT lexical  ( #PCDATA )>

<!ELEMENT contextModel ( #PCDATA )>
<!ATTLIST contextModel
   classes NOTATION "xml:xsl.patterns"
   tokenizer IDREF "xml:dom.element-nodes"

<!ELEMENT BNF   ( production )*>
<!ATTLIST BNF 
   terminals NOTATION "xml:rfc????.ebnf"
   tokenizer IDREF    "xml:xml.character" >
<!ELEMENT production ( #PCDATA )>
<!ATTLIST production 
  nonterminal NMTOKEN #REQUIRED >

<!ELEMENT userDefined ANY >
<!ATTLIST userDefined
  name ID #REQUIRED
  notation NOTATION #REQUIRED 
  tokenizer #IMPLIED >

Each of these forms allows variations on the same function: parsing or validating against a grammar:

anonymous checks data tokens against anonymous qualities:
regularExpression checks data tokens against a POSIX regular expression (see the XML Schema draft);
lexicalModel gives a series of alternatives against a picture (see the XML Schema draft);
contentModel checks a branch of a tree of data tokens against a content model (in this version I have not used element syntax for a content model, for reasons which which will be obvious soon: but it is possible and probably desirable to follow XML Schema in this);
contextModel checks a node in tree of data tokens against a pattern (see the XSL pattern spec);
BNF checks a row of tokens against a full suite of productions, allocating them to non-terminal classes;
The grammars are extensible through the use of userDefined grammars.

Each of these grammars represent a slightly different way of expressing statements about some data. Some set operation or logical operations could also be provided.

What is notable about each of these grammars is that they are parameterized by tokenizer and notation. The parameterization of tokenization means that you may define a regular expression over characters or over elements. In fact, a contentModel is a regular expression tokenized over sibling elements and data (with a slightly different syntax)! And, in fact, a lexicalModel is a kind of BNF grammar, in that it associates terminals with non-terminals. (Tokenizing is the process of determining the data takens to be presented to the grammar. )

If the tokenization was selected so that an IDREF was followed rather than the children elements, then it is possible to make schemas of graph structures: to validate links and provide strong type checking. It also then becomes possible to select the content model based on what element has linked to an element, rather than according to its type.

A further implication of allowing specifyable tokenization is that is becomes possible to specify, for example, that the grammar used should be some set operation on some other grammars. This has major impact on construction, and ties in with Murata Makoto's work.

The classes attribute (which is called "terminals" in the BNF grammar) specifies the particular notation used in the grammar. It allows, for example, an particular schema maker to use different letters for the lexicalModels. The different classes allows slightly different conventions for regular expressions: POSIX, XML content models, PERL, etc.

One particular use of these would be to allow whitespace found between certain elements to be labelled as non-signifcant, to overcome the bad mixed-content restriction: Paul Prescod has made a suggestion about this, to allow in effect content models of (#WS, x, #WS, y, #PCDATA, z), where #WS means ignorable whitespace.

The grammars could be augmented to allow any of the W3C specified languages: XPointers, Fragment Context Specifications, XQL, etc, to serve as the grammar.

Associating a Notation with a Context

The next stage is to associate a notation with a context. Note that contexts may have multiple notations apply, in layers or overlapping.

Simple Association

The simplest way to associate a notation with a context is to use a NOTATION attribute on an element.

The next simplest way is to associate a document object type with a notation.

<!ENTITY % simple-associations " element | PI | comment | data | elementType | attribute">

<!ELEMENT elements (html:body?, handler*, parameter*)>
<!ATTLIST elements 
  id  ID #REQUIRED
  namespace CDATA #IMPLIED
  notation NOTATION #REQUIRED >
  <!-applies to all elements, perhaps qualified by namespace alone -->

<!ELEMENT PI (html:body?, handler*, parameter*)>
<!ATTLIST PI
  id      ID #REQUIRED > 
  <!-- notation determined by the PI target -->

<!ELEMENT comment (html:body?, handler*, parameter*)>
<!ATTLIST comment
   id       ID #IMPLIED
   notation NOTATION #REQUIRED> 
   <!-- applies to all comments -->

<!ELEMENT data (html:body?, handler*, parameter*)>
<!ATTLIST data
   id       ID #IMPLIED
   notation NOTATION #REQUIRED> 
   <!-- applies to all data -->

<!ELEMENT elementType (html:body?, handler*, parameter*)>
<!ATTLIST elementType 
  name ID #REQUIRED
  namespace CDATA #IMPLIED
  notation NOTATION #REQUIRED >
  <!-- applies to an element type of this name, in any context.
   If the namespace is provided, it overrides the namespace prefix
   on the element type name, if present -->

<!ELEMENT attribute (html:body?, handler*, parameter*)>
<!ATTLIST attribute 
 name ID #IMPLIED
 elementType IDREF #IMPLIED
 namespace CDATA #IMPLIED
 notation NOTATION #REQUIRED >
  <!-- applies to an attribute of this name, in any context 
   If the namespace is provided, it overrides the namespace prefix
   on the element type name, if present, -->

<!ELEMENT attributeValue (html:body?, handler*, parameter*)>
<!ATTLIST attributeValue 
 name IDREF #IMPLIED
 elementType IDREF #IMPLIED
 namespace CDATA #IMPLIED
 notation NOTATION #REQUIRED >

The html:body element provides documentation. A handler is a downloadable application which can deal with a simple association: to display it, for example.

Extended Association

In extended associations, a context must be satisfied to associate the notation. This directly mirrors XSL: where a pattern is used to select document objects.

<!ELEMENT extendedAssociation ( html:body?, ( %grammar; ), handler*, parameter* )>
<!ATTLIST extendedAssociation
   name    ID #REQUIRED
   grammar IDREF #REQUIRED >

A context is found using the grammar mechanism itself! So, for example, the context could be an XSL pattern or it could be a content model, or it could use Xpointers to test the value of an attribute.

The html:body element provides documentation. A handler is a downloadable application which can deal with a simple association: to display it, for example.

Handlers & Parameters

Handlers are XLinks. They locate applications (e.g. Java applets) which provide some kind of useful service for an application. Examples are viewers, tokenizers, external validators. Parameters are xlinks which point to generic infomation that can be loaded by applications to make use of the data.

<!ELEMENT handler EMPTY>
<!ATTLIST handler
 %xlink; >
<!ELEMENT parameters EMPTY>
 %xlink; >

One immediate use for this mechanism is to provide a solution to the Private-Use-Area (PUA) characters. The PUA is an area reserved in Unicode; these kind of characters are widely used in East Asia, though less so under Unicode. To be able to transport non-standard characters in XML, one has to attach to the document information which describes the properties of the PUA characters, which the receiving end could make use of. This is a schema issue, because it involves the notation of characters. Some proposals, notably those of Prof. C.C. Hsieh, allow Chinese characters to be formed from parts using placement operators.

Handlers could be defined to render the character in that notation. A parameter could be provided that gave the collation sequence, to be loaded into the sort routines.

Top Level

Here is the top-level declaration.

<!ELEMENT notationSchema 
  ( extendedAssociation | %simple-associations; | notation )*>

Example (In progress)

As is traditional, this section gives the Notations Schema DTD both as XML markup declarations and using Notations Schemas, for comparison. (I haven't put in the attribute defaulting yet)

<schema>

<elementType name="notationSchema" notation="notationSchema.model" />
<notation name="notationSchema.model">
        <contentModel>( extendedAssociation | elements | PI | comment |
        elementType | data | attribute | notation )*
        </contentModel>
</notation>

<elementType name="handler" notation="handler.model" />
<notation name="handler.model">
        <contentModel>EMPTY</contentModel>
</notation>
<attribute elementType="handler" notation="" />

<elementType name="parameters" notation="parameters.model" />
<notation name="paremeters.model">
        <contentModel>EMPTY</contentModel>
</notation>
<attribute elementType="handler" notation="" />

<elementType name="extendedAssociation" notation="extendedAssociation.model" />
<notation name="extendedAssociation.model">
        <contentModel>( html:body?, ( regularExpression | lexicalModel | contentModel 
          | contextModel| BNF | userDefined), handler*, parameter* )
        </contentModel>
</notation>
<attribute elementType="extendedAssociation" notation="extendedAssociation.attribute.model" />
<notation name="extendedAssociation.attribute.model" >
        <contentModel>( name & grammar )
        </contentModel>
</notation>
<attributeValue model="extendedAssociation.attribute.model" 
        name="name" notation="xml:ID">
<attributeValue model="extendedAssociation.attribute.model" 
        name="grammar" notation="xml:IDREF">

<elementType name="elements" notation="elements.model" />
<notation name="elements.model">
        <contentModel>(html:body?, handler*, parameter*)</contentModel>
</notation>
<attribute elementType="elements" notation="elements.attribute.model" />
<notation name="elements.attribute.model" >
        <contentModel>( id & namespace? & notation)
        </contentModel>
</notation>
<attributeValue model="elements.attribute.model" 
        name="id" notation="xml:ID"> 
<attributeValue model="elements.attribute.model" 
        name="namespace" notation="xml:CDATA"> 
<attributeValue model="elements.attribute.model" 
        name="notation" notation="xml:NOTATION">

<elementType name="PI" notation="PI.model" />
<notation name="PI.model">
        <contentModel>(html:body?, handler*, parameter*)</contentModel>
</notation>
<attribute elementType="PI" notation="PI.attribute.model" />
<notation name="PI.attribute.model" >
        <contentModel>( id )
        </contentModel>
</notation>
<attributeValue model="PI.attribute.model" 
        name="id" notation="xml:ID">

<elementType name="comment" notation="comment.model" />
<notation name="comment.model">
        <contentModel>(html:body?, handler*, parameter*)</contentModel>
</notation> 
<attribute elementType="comment" notation="comment.attribute.model" />
<notation name="comment.attribute.model" >
        <contentModel>( id? & notation )
        </contentModel>
</notation>
<attributeValue model="comment.attribute.model" 
        name="id" notation="xml:ID"> 
<attributeValue model="comment.attribute.model" 
        name="notation" notation="xml:NOTATION">

<elementType name="data" notation="data.model" />
<notation name="data.model">
        <contentModel>(html:body?, handler*, parameter*)</contentModel>
</notation>
<attribute elementType="data" notation="data.attribute.model" />
<notation name="data.attribute.model" >
        <contentModel>( id? & notation )
        </contentModel>
</notation>
<attributeValue model="data.attribute.model" 
        name="id" notation="xml:ID"> 
<attributeValue model="data.attribute.model" 
        name="notation" notation="xml:NOTATION">

<elementType name="elementType" notation="elementType.model" />
<notation name="elementType.model">
        <contentModel>(html:body?, handler*, parameter*)</contentModel>
</notation>
<attribute elementType="elementType" notation="elementType.attribute.model" />
<notation name="elementType.attribute.model" >
        <contentModel>( name & notation & namespace? )
        </contentModel>
</notation> 
<attributeValue model="elementType.attribute.model" 
        name="name" notation="xml:ID"> 
<attributeValue model="elementType.attribute.model" 
        name="namespace" notation="xml:CDATA"> 
<attributeValue model="elementType.attribute.model" 
        name="notation" notation="xml:NOTATION">

<elementType name="attribute" notation="attribute.model" />
<notation name="attribute.model">
        <contentModel>(html:body?, handler*, parameter*)</contentModel>
</notation>
<attribute elementType="attribute" notation="attribute.attribute.model" />
<notation name="attribute.attribute.model" >
        <contentModel>( name? & elementType? & namespace? & notation )
        </contentModel>
</notation>
<attributeValue model="attribute.attribute.model"  
        name="name" notation="xml:ID"> 
<attributeValue model="attribute.attribute.model" 
        name="elementType" notation="xml:IDREF"> 
<attributeValue model="attribute.attribute.model" 
        name="namespace" notation="xml:CDATA"> 
<attributeValue model="attribute.attribute.model" 
        name="notation" notation="xml:NOTATION">

<!-- to be done:

<!ELEMENT attributeValue (html:body?, handler*, parameter*)>
<!ATTLIST attributeValue 
 name IDREF #IMPLIED
 elementType IDREF #IMPLIED
 namespace CDATA #IMPLIED
 notation NOTATION #REQUIRED >

-->

<elementType name="regularExpression" notation="regularExpression.model" />
<notation name="regularExpression.model">
        <contentModel>(#PCDATA)</contentModel>
</notation>
<attribute elementType="regularExpression" notation="regularExpression.attribute.model" />
<notation name="regularExpression.attribute.model" >
        <contentModel>( notation? & tokenizer? )
        </contentModel>
</notation>
<attributeValue model="regularExpression.attribute.model" 
        name="classes" notation="xml:NOTATION"> 
<attributeValue model="regularExpression.attribute.model" 
        name="tokenizer" notation="xml:IDREF">

<elementType name="lexicalModel" notation="lexicalModel.model" />
<notation name="lexicalModel.model">
        <contentModel>( lexical )*</contentModel>
</notation>
<attribute elementType="lexicalModel" notation="lexicalModel.attribute.model" />
<notation name="lexicalModel.attribute.model" >
        <contentModel>(notation? & tokenizer?) 
        </contentModel>
</notation>
<attributeValue model="lexicalModel.attribute.model" 
        name="classes" notation="xml:NOTATION"> 
<attributeValue model="lexicalModel.attribute.model" 
        name="tokenizer" notation="xml:IDREF">

<!ATTLIST lexicalModel 
   classes NOTATION  "xml:schema.lexical-types" 
   tokenizer IDREF   "xml:xml.character" >

<!-- to do <!ELEMENT lexical  ( #PCDATA )>
-->

<elementType name="contentModel" notation="contentModel.model" />
<notation name="contentModel.model">
        <contentModel>(#PCDATA)</contentModel>
</notation>
<attribute elementType="contentModel" notation="contentModel.attribute.model" />
<notation name="contentModel.attribute.model" >
        <contentModel>(classes? & tokenizer?) 
        </contentModel>
</notation>
<attributeValue model="contentModel.attribute.model" 
        name="classes" notation="xml:NOTATION"> 
<attributeValue model="contentModel.attribute.model" 
        name="tokenizer" notation="xml:IDREF">

<!--ATTLIST contentModel 
   classes NOTATION  #IMPLIED  
   tokenizer IDREF   "xml:dom.element-nodes" -->

<!--ATTLIST contextModel
   classes NOTATION "xml:xsl.patterns"
   tokenizer IDREF "xml:dom.element-nodes" -->

<elementType name="contextModel" notation="contextModel.model" />
<notation name="contextModel.model">
        <contentModel>ANY</contentModel>
</notation>
<attribute elementType="contextModel" notation="contextModel.attribute.model" />
<notation name="contextModel.attribute.model" >
        <contentModel>( terminals? & tokenizer )
        </contentModel>
</notation>
<attributeValue model="contextModel.attribute.model" 
        name="classes" notation="xml:NOTATION"> 
<attributeValue model="contextModel.attribute.model" 
        name="tokenizer" notation="xml:IDREF">

<!--ATTLIST BNF 
   terminals NOTATION "xml:rfc????.ebnf"
   tokenizer IDREF    "xml:xml.character" -->

<elementType name="BNF" notation="BNF.model" />
<notation name="BNF.model">
        <contentModel>(production)*</contentModel>
</notation>
<attribute elementType="BNF" notation="BNF.attribute.model" />
<notation name="BNF.attribute.model" >
        <contentModel>( nonterminal )
        </contentModel>
</notation>
<attributeValue model="BNF.attribute.model" 
        name="terminals" notation="xml:NOTATION"> 
<attributeValue model="BNF.attribute.model" 
        name="tokenizer" notation="xml:IDREF">

<elementType name="production" notation="production.model" />
<notation name="production.model">
        <contentModel>#PCDATA</contentModel>
</notation>
<attribute elementType="production" notation="production.attribute.model" />
<notation name="production.attribute.model" >
        <contentModel>( name & notation & tokenizer?)
        </contentModel>
</notation>
<attributeValue model="production.attribute.model" 
        name="nonterminal" notation="xml:NMTOKEN">

<elementType name="userDefined" notation="userDefined.model" />
<notation name="userDefined.model">
        <contentModel>ANY</contentModel>
</notation>
<attribute elementType="userDefined" notation="userDefined.attribute.model" />
<notation name="userDefined.attribute.model" >
        <contentModel>( id & notation & tokenizer? )
        </contentModel>
</notation>
<attributeValue model="userDefined.attribute.model" 
        name="name" notation="xml:ID"> 
<attributeValue model="userDefined.attribute.model" 
        name="notation" notation="xml:NOTATION"> 
<attributeValue model="userDefined.attribute.model" 
        name="tokenizer" notation="xml:IDREF">

<elementType name="notation" notation="notation.model" />
<notation name="notation.model">
        <contentModel>( html:body?, 
                ( regularExpression | lexicalModel | contentModel 
                 | contextModel| BNF | userDefined  )*, handler*, parameter*)
        </contentModel>
</notation>
<attribute elementType="notation" notation="notation.attribute.model" />
<notation name="notation.attribute.model" >
        <contentModel>( name & system? & public? & mime? )
        </contentModel>
</notation>
<attributeValue model="notation.attribute.model" 
        name="id" notation="xml:ID"> 
<attributeValue model="notation.attribute.model" 
        name="system" notation="xml:SYSTEM"> 
<attributeValue model="notation.attribute.model" 
        name="public" notation="xml:PUBLIC"> 
<attributeValue model="notation.attribute.model" 
        name="mime" notation="MIME:mediaType">

</schema>