Grammars

A RAN stream can be processed at many levels of granularity, depending on the requirement.

The most basic division is that a RAN document is made from a number of independent partitions called fragments.

RAN can be lexed at many different grains, the most important of which is below. In general, there are six basic levels of granularity:

  • Finite-stream tags (look for <<<< at start and end only)
  • Fragment identification  (look for <<<)
  • Tag identification  (look for <)
  • Context tags (look for <<)
  • Attribute tokenizing  (look for spaces, =, >, " etc.)
  • Attribute indicator link-typing

Then there are several  levels of “notation” parsing that may be done by the parser, or lazily on demand, or  or by some application layer.

  • Attribute-value data-typing  (see section Extended Lexical Datatypes)
  • Embedded CSV  (see section RAN-CSV)
  • Single-character rich text tags (quasi HTML)
  • “Note attributes", "PI Attributes" 
  • Potentially, other notations 

An individual partition (i.e. a section of the document that starts with the fragment-start-tag open-delimiter "<<<]") may be ill-formed without this making any subsequent partition ill-formed. Each partition can be parsed separately.

In RAN, end-tags (element, scoped-element, fragment) must have a duplicate of the initial ID attribute on the corresponding open-tag. Fragments and scoped-elements must have IDs.

Simple Scanning by Regex

The ability to parse to milestones using regular expressions, then at or between those milestones match values using the Tokenizing grammar means that easy queries can be performed on the raw text array without allocating objects.1

Scan for tags (Regular Expression)

The simplest complete lexical scanner breaks the document into sequences of [tag-open delimiter, tag contents, tag-close delimiter, data ] capture-group tuples. 

stream  := ws* ( 
                 ( "<"+ "/"? "?"? ) 
                 ( [^<>]+ )                       TAG: Max 2^16 bytes 
                 ( ">"+ )?  
                 ( [^<>]* )                       DATA: Max 2^16 bytes
                )+

Here is a more complex scanner  that also tokenizes inside tags: attributes, PIs and comments. The rules here implement the intended semantic that a < or > delimiter always closes the  tag, so provides error recovery.

stream  := \s* (                                        IGNORABLE WHITESPACE
                ("<!--"  [^<>] "--")   |                COMMENT
                (
                 ( "<"+ ("/" | "?" )? )                  TAGO
                 ( [^<>\s\"]+ )                          GI =< 256 bytes
                 (\s+                                    IGNORABLE WHITESPACE
                   ("\"" [^<>\"\s=]+ "\"" ) |            LITERAL
                   ("--" [^<>] "--"  )  |                COMMENT        
                   ("&#\["" [^0-19A-F=]+ "\]" ) |        BINARY BYTE STRING
                   "}"? "=" "="? "{"?              |                  VALUE INDICATOR
                   "[" | "]" |                           TUPLE DELIMITER
                   "..." |                               ELLIPSIS
                   "?" |                                 EXTERNAL
                   ( [^\"=\[\]\.\?\s<>][^<>\s\"=]* ) |  TOKEN   
                   \s+                                   IGNORABLE WHITESPACE     
                 )
                )
                ( ( ">"+ ) |                             TAGC (1 expected)
                  ( [^<>]+ )                             DATA 
                )*
              )+

A TOKEN is any run of characters that is not ASCII whitespace, <, >   (and not a literal, a comment or ellipses) can further be lexically typed using the following, tested in order:

  • starts with +- or digit  then is QUANTITY   e.g. 0 or  +1 or 440_Hz/m
  • contains - not at start or end then is ISO 8601 DATE TIME RANGE e.g. 2024-08-27 or 
  • contains + or ~ not at start or end then is ANCHOR PATH e.g. fred+ginger
  • contains “/” then is URI  e.g. /x/y
  • contains “:” then is PREFIXED NAME e.g. fred:ginger
  • starts with $ then is MONEY e.g.  $26.00  or $26.00_AU_2024-16-01
  • starts with LATIN 1 block character then is SIMPLE (typically an ASCII identifier, logic value etc)
  • starts with # then is hex NUMSTR e.g. RGB   "#01AB5F"
  • starts and ends with “%” then is EXTERNAL
  • starts with Unicode currency (any value in currency block) then is MONEY
  • otherwise is SIMPLE

SIMPLE can further be broken down into reserved and pre-defined logic values, otherwise are NAMEs. 

Scan for Fragment boundaries only  (Regular Expression)

stream ::= .*(<<<[^/]*)*

or
stream ::= .*((<<<[^/]*<<<[/].*>>>).*)*

Fragments cannot nest, so matching start and end tags is reliable. Fragment comments, PIs, etc are excluded.

Scan for Fragment Tags  only (Regular Expression)

stream ::= .*(<<<[^<].+>>>[^>].*)*

Scan for Scoped Element boundaries in Fragments (Regular Expression)

fragment contents ::= .*(<<[^/].+.*)*

Scan for any Tags in Fragments (Regular Expression)

fragment contents ::= .*((<<?.+>>?).*)*

Scan for any  Tags in Scoped Elements or Fragments with no Scoped Elements (Regular Expression)

element contents ::= .*((<.+>).*)*

Scan for Tags (Regular Expression): 2

stream ::= \s*((<[<[<]?]?.+>[>[>]?]?).*)*

Encoding Grammar:3

stream ::= UNIT*
UNIT ::= U8* | U16*
4
U8 ::= BYTE > 8
U16 ::= WORD > 8

Tokenizing by Grammar

Tokenizing grammar:5

stream   ::= ws* ( TAGO IN-MARKUP TAGC) IN-DATA 
TAGO     ::= “<”{1..4} ("/" | "?" | "!" "-"*  )?
TAGC     ::= ("-"+ | "?"  )? “>”{1..4} 
IN-MARKUP::= (ws |  (“=" "="? "{”?) | ("}=" "="?) |
    LITERAL | TOKEN | ELLIPSIS |COMMENT | BINARY)*
ELLIPSIS  ::= "..."
COMMENT  ::= "--" ( ANY except "<>" )* "--"  
IN-DATA  ::= (ANY | REFERENCE)          // so: ANY = [^<>&]
TOKEN    ::= (ANY except "="| REFERENCE) // so: ANY = [^<>&=”ws]
LITERAL  ::= “"” (ANY except "=" | REFERENCE)* “"”
                                        // so: ANY = [^<>&=”ws]
BINARY   ::= "&#[" (ANY except "]<>" )+  "]" 
REFERENCE::= “&” (ANY except "=")+ “;”  // so: ANY = [^<>&=”ws;]
TOKEN    ::= BYTE{1..256} | WORD{1..128}

There are two kinds of character references which can be used at any point in the document6, both in data and in tags, but are not then lexed as delimiters7:

  • Numeric character references (delimiters “&#x” and “;”) are the hex Unicode number.

  • Named character references (delimiters “&” and “;”) are the standard W3C entity set version.8

Full Tag Grammar

In the following, "ws" is  potentially ignorable  (ASCII) whitespace.9

document ::= head body tail
head     ::= ( pi | ws)*
body     ::= ( finite-stream | stream | scoped-element | element )? 
tail     ::= ( pi | ws)*

finite-stream ::= finite-stream-start-tag stream finite-stream-end-tag
stream    ::= (fragment | pi | ws*)+ 
finite-stream-start-tag ::= “<<<<” name  scoped-unique-identifier ATTRIBUTES “>>>>” (comment | ws)*
finite-stream-end-tag   ::= “<<<</” name  scoped-unique-identifier COMMENT* “>>>>”

fragment  ::= ( fragment-start-tag f-contents fragment-end-tag )
fragment-start-tag ::= “<<<” name  scoped-unique-identifier ATTRIBUTES “>>>”
fragment-end-tag   ::= “<<</” name  scoped-unique-identifier COMMENT* “>>>”
                  (comment | ws)*
f-contents ::=  ( element | scoped-element )*
fragment-comment ::= "<<<!" "-"* FREE-TEXT "-"* "->>>" 
fragment-pi ::= "<<<?" name FREE-TEXT "?>>>"   

scoped-element ::= ( scope-start-tag  contents scope-end-tag ) 
scope-start-tag ::= “<<” name scoped-unique-identifier ATTRIBUTES “>    
      (comment | ws)*  
scope-end-tag ::= “<</” name  scoped-unique-identifier  COMMENT(  “>>”
scope-fragment-comment ::= "<<!" "-"* FREE-TEXT "-"* "->>" 
scope-fragment-pi ::= "<<?" name FREE-TEXT "?>>" 

element   ::= ( start-tag contents end-tag ) | list-content
start-tag ::= “<” name  scoped-unique-identifier? ATTRIBUTES “>”
      (comment | ws) 
end-tag   ::= “</” name?  scoped-unique-identifier?  COMMENT* “>”


contents  ::= ( TEXT | element | scoped-element  )*

comment   ::= “<!” “-”* FREE-TEXT “-”* “->” 
pi        ::= “<?” name FREE-TEXT-OR-ATTRIBUTES  “?>” 


FREE-TEXT     ::= [^<>]*
FREE-TEXT-OR-ATTRIBUTES :== ATTRIBUTES | FREE-TEXT
ATTRIBUTES::= ws? 
   (key | reference |  attribute | ellipsis | comment |
   (content-key | content-reference  | content-attribute)
scoped-unique-identifier  ::= name ws* "}"? "=" ws* value ws* // i := encouraged
key         ::= name ws* "}=" ws* value ws*
reference   ::= name ws* "={" ws* value ws*
attribute   ::= name  ws* "=" ws* value ws* 

content-key        ::= name ws* "}==" ws* value ws*
content-reference  ::= name ws* "=={" ws* value ws* 
content-attribute  ::= name  ws* "==" ws value ws

comment    ::= "--"  FREE-TEXT "--"

name       ::= name-token(“:” name-token)? | literal
value      ::= lexically-typed-token | literal | tuple |binary-byte-string
literal    ::= “"” (TEXT except "=") “"”
tuple      ::=  "[" ( ws+ | lexically-typed-token | literal )* "]"
binary-byte-string ::= "&#[" ([01-9A-F]{2} )+ "]"
elipsis    ::= "..."
name-token ::= [^\.\-=<>][^-=<>]*

See Extended Lexical Datatypes of the grammar for lexically-typed-tokens. These utilize various symbols, such as $, +, €, ¤, °, and so on, to select a datatype and parse the components.  This reduces markup, as the RAN datatypes implement more of the base standards than the common markup languages and XML Schemas datatypes.

  • All references start with delimiter & and end with ;  and "&" never is used literally. Use &ap;
  • All tags start with the delimiter < and end with the delimiter > and "<" and ">" are never used literally. Use &lt; and &gt;.
    • All delimiter recognition only takes 1 character lookahead (open delimiter <) or look behind, except for fragment tags and scoped -element tags.11
    • The matching tag delimiters pairs are </? >   <? ?> <! -> <: :>    (
    •  The delimiter set is <>&;”?!/:- []#%=. _{} 

Inside their delimiters, tags follow one of three lexical patterns:

  • NAMED-TAG-WITH-ATTRIBUTES such as a start-tag,
  • UNNAMED-TAG such as a comment, and
  • NAMED-TAG-WITH-FREE-CONTENT-OR-ATTRIBUTES such as a processing instruction or end-tag,
    • DatatypesIf the contents contains “=” then they are parsed as RAN attribute syntax (as “PI attributes” or "Note attributes", otherwise it is free text. The expectation is that these would be parsed on demand (as an invocation parameter), and that the simple parse tree for the fragment would not (as an invocation parameter) include this information.

Inside literals in tags (e.g. attribute values), the “=” character cannot be used literally: use &eq;.

Comments and PIs

  • There are two kinds of comments:
    • Comment tags can only occur immediately after a start-tag, and belong to that start-tag.
    • Attribute comments can appear inside a start or end-tag between “--” and “--” deimiters. The characters “<>=&" must use character references, and if there are two - then one of them must be a character reference.   e.g. 
      <a b="c" -- this is a comment --  d="f" >...</a>
  • PIs can only appear at the very start and end of a document, and belong to the document.
    • A PI  not in the first or last position should be ignored. It may cause an error or warning.
  • Text runs that only contain whitespace which are before or after a comment or PI are not reportable as text data
  • Thus, after the initial comments, the contents of an element/scope can only be text data or elements/scopes. This simplifies the stuctures developers need to cope with.
  • PIs between fragments are not an error, but are thrown away: in finite streams, this allows new fragments to be appended with following PI metadata, that in turn gets invalidated by the next fragment

Whitespace text runs

A Whitespace text run is any run of text between tags that only contains ASCII whitespace: such text is Ignorable (not part of the Infoset)  or Reportable. Not all Reportable whitespace is necessarily significant. Whether whitespace is Ignorable depends only on the two immediately surrounding tags.

  • Whitespace runs that appears between two comments or start-tags (of any kind) or between two end-tags (of any kind) is Ignorable. I.e, 
    •  */node()[1][self::text()][normalize-space(.)='']  
    • */node()[last()][self::text()][normalize-space(.)='']  
  • Whitespace runs  that comes before or after a steam, fragment or scope start- or end-tag is Ignorable.
  • Other whitespace runs are Reportable: in particular, whitespace between an end- and a start-tag (internal whitespace) or between an  start- and an end-tag (blank element).

The intent is to  reduce the number of nodes in a DOM,  to allow fragments to represented as a simple list, and to make  indentation of the first few levels harmless.

Character References and Binary Byte Strings

There are three kinds of character references:

  • Named character reference
    • e.g., &mdash;
    • All  and only SGML/MathML/HTML public entities are built-in.
  • Hex Numeric character references
    • e.g., &#x00A1;
    • All non-control UNIX characters are allowed.  (i.e. not C0, C1, therefore no 0x00 NULL)
  • Binary Byte String
    • Is a kind of literal
    • e.g.,  &x[01ABCDEF]
    • A list of pairs of hexadecimal numbers, each specifying a byte. These can be any byte including 0x00 NULL.
    • If a Binary Byte String  is found, the parser can
      • generate a Fragment- or Scope- level WF error (default) and strip the reference 
      • OR, at system option, include the reference literally (as the actual characters), 
      • OR, at system option,  include the sequence of characters; this may be dangerous if used in some attack. It is only available as a data value

Character references are all dereferenced after all lexical operations are performed. Hence:

  • Character references may not be used for basic name characters: [a-zA-Z01-9]
  • The only way to reresent the delimiters <>& in text is to use a character reference.
  • A character reference e.g. &lt;  is never  recognized as a delimiter of any kind, and never changes the recognition of  markup.
  • Character references are supported in tokens (including element names), literals, data content, processing instructions and PIs.

The replacement UTF-8 for any character reference takes up  fewer bytes than the reference. So deferencing can be done in a byte buffer without having to grow it.

Constraints:

  • The name in an end-tag must match12 that of the corresponding start-tag.
  • If a  fragment start- and end-tag have identifier attributes (using :=) they must match
  • An element or attribute name (TOKEN) must not exceed 256 bytes (UTF-8) or 128 ushorts (UTF-16) before any reference substitutions are made. 
  • Each tag or text run cannot exceed length 2^16  (64K)
  • Each fragment cannot exceed length 2^32  - 2^12 (4 Gig -4k)
    • The -4k is to ensure the document can start anywhere in the first page (4k) for potential alignment purposes.
  • The grammar gives the lexical detection rules for boolean, date, number and name. The value should be checked for full lexical correctness before it is first used, but there is no requirement to check full lexical correctness or value-validity if the datatype is no accessed.

  • An attribute specified as x=y is not the same as x=“y“ nor “x“=y nor “x“=“y“.13

  • Element end-tags and comments use the same internal syntax as processing instructions: they start with a name then may contain any text, including references, until the tag close delimiter “>” or “>>”. This free text is treated as a comment, not a PI or post-fixed attributes. (Implemention Note: The second token in a fragment close tag may, for example, contain a content checksum, made by adding all characters outside tags and all characters in attribute literals, with references resolved and whitespace stripped; the purpose of this is to detect data transmission errors or naive changes; it is not a security measure, not will it detect problems with whitespace corruption. )

  • For names and other tokens in attribute values, the maximum number of consecutive non-delimiter alpha-numeric bytes in UTF-8 is 255. 15  (So very long names need to use "_" - or ".")

  • APIs exposing the data should expose name tokens in an NFC normalized form.16

Possible Parser Interface (To be revised)

This Russian Doll designs allows streamlined access to individual items.  For example, fast iterators or cursors such as, which can have no argument or take a name and/or ID:

  • ran::getNextStartFragmentInDocument()  -> Fragment
  • ran::getNextStartContextInFragment() -> Context
  • ran:getNextStartElementInContext()->StartElement
  • ran:getNextAttributeInStartTag() -> Attribute
  • ran:getNextIdInStartTag() -> ID 

Or chainable:   /*/*[@i="drugs"]/*[@i="pain"]/drug[@i="aspirin"][@dangeous="NO"]/@dosage

d: RanDocument =9 new RanDocument(...);
q: Quantity?  =  
d.getNextFragmentStart("", "drugs")             iterate through fragments stag
 .getNextContextStart("", "pain", LAZY)         iterate through contexts stag
 .getNextElementStart("drug", "aspirin", FORGET)    iterate through element stag
 .getElementNode(LAZY | DESC)                   get node or parse if not parsed
 .whereAttribute("dangerous:", Logic("NO"))     condition
 .getNextAttribute("dosage","*")                iterate through attributes
 .getTypedValue(FORGET);                        get quantity

where the string is a regex that matches an id (@i or first position :=) flags are different parse actions:

  • FORGET - do not update info or objects on intermediate tags and data, until the one is found
  • LAZY - do not parse attributes except id, nor child elements, until the one is found
  • (none) - parse all elements or  attributes until the one is found
  • TYPE - parse all elements or  attributes and do datatyping until the one is found
  • FULL - pre-parse all element atributes and do datatyping

and different scope 

  • (none) - just this tag or node
  • DESC - do same on descendents
  • SBL - do same on sibling
  • BRANCH - do same on entire branch that contains this
  • SCOPE - do same on entire scope
  • FRAGMENT - do same on entire fagment
  • ALL - do same on stream

Generic Master Parsing Function (Tobe revised)

A master generic parser function that combines parsing and simple query might be:

  • parseRanDocument(
         from: Node,
        direction: enum { forwards | backwards },
        fragment_ids: []ID,
       ids:[]ID,
       produce_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, all_elements, attributes, children, typed-attributes, note-attributes, processing-instructions, processing-instruction-attributes, comments, links } ,
    extended: bitset { table, CSV, rich-text, ... },
       validate_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, attributes, typed-attributes, note-attributes, processing-instructions, processing-instruction-attributes, comments, links } ,
      validation: bitset { fatal, error, warning, info, hint, verbose }
    )
                  => Result {
                                  (validityReport,   fragmentNode(), Node(), extended data )*,
                                 status?}  

This function would

(a) Scan the document (backwards or forwards from some position in the document) for the fragments with the given fragment IDs

(b) Parse the fragment to the extent needed to produce the particular nodes specified in “produce_nodes”, producing the kinds of nodes specified by produce_nodes

  • E.g. produce_nodes of FRAGMENTS|SCOPED-ELEMENTS|ELEMENTS|NEXT_SIBLING_CONTEXT would, if looking for element with ID of “x123” produce a tree of nodes with the fragment, and any element or scoped-elements between the fragment and the identified element,  the identified element, the immediate sibling elements of these,  but no other child nodes. It would not produce attribute values.  If this result was copied by value, it would lose the ability to have the other parts lazily evaluation. Otherwise, asking for an attribute value would cause parsing of the values. 

(c) The document is also validated, using the same options. 

  • So the information validated can be different (less or more) than the information produced. For example, from  “validate the whole document but do not produce a parse tree” to "produce a complete parse tree but do not validate it."

The intent is that minimum-necessary substree can be extracted at parse time, with the maximum appropriate laziness, for a particular task. 

Footnotes

1This makes it straightforward for developers to exploit optimizations available on modern CPU and disks: SIMD, Vectorization, parallel threads, SPMD, predictable prefetch, cache sizes, locality from in-place referencing.

2This grammar can be used to locate fragments and element tags.

3Implementation Note: So the implementation is free to use NULL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS or their code points. Their presence in the stream is terminal for the stream but not an error if all fragments and tags are complete. The other 19 non-whitespace controls are deprecated but unchecked, for efficiency.
An implementation may also check for the C1 and the other non-whitespace C0 control characters.

4If the raw data is UTF-8 then the UNIT is byte, a character is a sequence of all 8-bit units. If the raw data is UTF-16, the UNIT are all 16-bit units (in the case of a PUA or non-BMP Unicode repertiore character, the character may take more than one 16-bit units.) The intent is that an implementation supports one or the other or, preferably, both.

5This grammar can be used to locate fragment and element tags, tokenize start-tags and handle references.

6The use of references in names and ids is allowed but deprecated: it could break raw scanning.

7Implementation impact: This has the effect that the document can be navigated entirely using delimiters. There are no nested entities.

8Implementation Note: The size of the character reference replacement text is never more than the character reference itself. Character reference resolution of a string can be performed in-place, without needing a new buffer.

9These grammar productions should be interpreted as mutually excluding: an identifier required by a prior production is excluded from a subsequent production: e.g. a NAMED-TAG-WITH-FREE-CONTENT has an implicit exclusion of “>”. Reference substitution occurs in tokens and type-values before sub-parsing and TEXT.

11Implementation Note: because fragments cannot nest, the next << after a fragment-start’s << must be the opening delimiter of the corresponding fragment-close. So for linear lexing, the delimiters effectively only need 1 character lookahead from the initial <, with the “/” being a confirming check on the even fragment tags.

12An implementer may use any form of matching (raw bytes, character-refence-substituted bytes, NFC-normalized, etc.) as suits the implementation and deployment goals of the system.

13However, if converting it to some other form without datatyped values or quoted element names, an implementer may decide that they are equivalent.

14Implementation Note: Therefore the whole of any raw token can be read into a 128 byte SIMD vector, before and after character reference substitution, and before and after any Unicode Normalization.

15Implementation Note: Therefore a raw small-token converted from UTF-8 to UTF-16 can always be contained  two 128 byte SIMD vectors. For example, a name with only Latin alphabetic characters is restricted to 128 Unicode characters; a name with only Han ideographs is restricted to 256/3=85 characters. This may also allow convenient SIMD filtering to exclude, e.g. ASCII only or Han-only tokens from attempts to normalize them.

16Implementation Note: Normal text content and the content quoted attributes should not be normalized by APIs. If the data source is private or trusted to produce normalized names in tokens, an implementation may bypass normalization.

17Implementation Note: This functionality must be implemented and exposed in some API. Given a reference to F1:X123 the value can be found in unparsed raw text by by first scanning for the fragment (a linear scanning of the text for “<<<[^/] until the fragment with @id of “F1” is found, then scanning start-tags until the next “>” for the first attribute value of “X123”. This is a rough-and-ready text operation that can be performed on the raw text, to simulate ID/IDREF or keyed links, and relies on the reference attribute value to be unique among all attribute values in the fragment (not only ID values: it has nothing to do with ID types).