Grammars

A RAN stream can be processed at many levels of granularity, depending on the requirement.

The most basic division is that a RAN document is made from a number of independent partitions called fragments.

RAN can be lexed at many different grains, the most important of which is below. In general, there are six basic levels of granularity:

  • Finite-stream tags (look for <<<< at start and end only)
  • Fragment identification  (look for <<<)
  • Tag identification  (look for <)
  • Context tags (look for <<)
  • Attribute tokenizing  (look for spaces, =, >, " etc.)
  • Attribute indicator link-typing

Then there are several  levels of “notation” parsing that may be done by the parser, or lazily on demand, or  or by some application layer.

  • Attribute-value data-typing  (see section Extended Lexical Datatypes)
  • Embedded CSV  (see section RAN-CSV)
  • Single-character rich text tags (quasi HTML)
  • Potentially, other notations 

An individual partition (i.e. a section of the document that starts with the fragment-start-tag open-delimiter "<<<]") may be ill-formed without this making any subsequent partition ill-formed. Each partition can be parsed separately.

In RAN, end-tags (element, scoped-element, fragment) must have a duplicate of the initial ID attribute on the corresponding open-tag. Fragments and scoped-elements must have IDs.

Simple Scanning by Regex

The ability to parse to milestones using regular expressions, then at or between those milestones match values using the Tokenizing grammar means that easy queries can be performed on the raw text array without allocating objects.1

Scan for tags (Regular Expression)

The simplest complete lexical scanner breaks the document into sequences of [tag-open delimiter, tag contents, tag-close delimiter, data ] capture-group tuples. 

stream  := ws* ( 
                 ( "<"+ "/"? "?"? ) 
                 ( [^<>]* )                       TAG: Max 2^16 bytes 
                 ( ">"+ )?  
                 ( [^<>]* )                       DATA: Max 2^16 bytes
                )+

Here is a more complex scanner  that also tokenizes inside tags: attributes, PIs and comments. The rules here implement the intended semantic that a < or > delimiter always closes the  tag, so provides error recovery.

stream  := \s* (                                        IGNORABLE WHITESPACE
                ("<--" [^<>]* "-->")   |                COMMENT
                (
                 ( "<"+ ("/" | "?" |"_" )? )             TAGO
                 (
                  ( [^<>\s\"]+ )                          GI =< 256 bytes
                  (\s+                                    IGNORABLE WHITESPACE
                   ("\"" [^<>\"\s=]+ "\"" ) |            LITERAL
                   ("&#\["" [^0-19A-F=]+ "\]" ) |        BINARY BYTE STRING
                   ("="+ "}"? ) |                       VALUE INDICATOR
                   "[" | "]" |                           TUPLE DELIMITER
                   "..." |                               ELLIPSIS
                   "?" |                                 EXTERNAL
                   ( [^\"=\[\]\.\?\s<>][^<>\s\"=]* ) |   TOKEN   
                   \s+                                   IGNORABLE WHITESPACE
                  )*
                 )?
                 ( ("?"? ">"+ ) |                             TAGC (1 expected)
                  ( [^<>]+ )                             DATA 
                 )*
                ) 
              )+

A TOKEN is any run of characters that is not ASCII whitespace, <, >  can further be lexically typed using the following, tested in order:

  • starts with +- or digit  then is QUANTITY   e.g. 0 or  +1 or 440_Hz/m
  • contains - not at start or end then is ISO 8601 DATE TIME RANGE e.g. 2024-08-27 or 
  • contains + or ~ not at start or end then is ANCHOR PATH e.g. fred+ginger
  • contains “/” then is URI  e.g. /x/y
  • contains “:” then is PREFIXED NAME e.g. fred:ginger
  • starts with $ then is MONEY e.g.  $26.00  or $26.00_AU_2024-16-01
  • starts with LATIN 1 block character then is SIMPLE (typically an ASCII identifier, logic value etc)
  • starts with # then is hex NUMSTR e.g. RGB   "#01AB5F"
  • starts and ends with “%” then is EXTERNAL
  • starts with Unicode currency (any value in currency block) then is MONEY
  • otherwise is SIMPLE

SIMPLE can further be broken down into reserved and pre-defined logic values, otherwise are NAMEs. 

Scan for Fragment boundaries only  (Regular Expression)

stream ::= .*(<<<[^/]*)*

or
stream ::= .*((<<<[^/]*<<<[/].*>>>).*)*

Fragments cannot nest, so matching start and end tags is reliable. Fragment comments, PIs, etc are excluded.

Scan for Fragment Tags  only (Regular Expression)

stream ::= .*(<<<[^<].+>>>[^>].*)*

Scan for Scoped Element boundaries in Fragments (Regular Expression)

fragment contents ::= .*(<<[^/].+.*)*

Scan for any Tags in Fragments (Regular Expression)

fragment contents ::= .*((<<?.+>>?).*)*

Scan for any  Tags in Scoped Elements or Fragments with no Scoped Elements (Regular Expression)

element contents ::= .*((<.+>).*)*

Scan for Tags (Regular Expression): 2

stream ::= \s*((<[<[<]?]?.+>[>[>]?]?).*)*

Encoding Grammar:3

stream ::= UNIT*
UNIT ::= U8* | U16*
4
U8 ::= BYTE > 8
U16 ::= WORD > 8

Complete Tokenizing

Tokenizing grammar: (regular expression)5

stream   ::= ws* ( (TAGO IN-MARKUP TAGC) | LIST-OPEN | LIST-CLOSE) IN-DATA 
LIST-OPEN ::= "<_>"
LIST-CLOSE ::= "</_>"
TAGO     ::= “<”{1..4} ( "?" | | "/"  )?
TAGC     ::= ( "?"  )? “>”{1..4} 
IN-MARKUP::= (ws |  (“="+ "}”? ) )  |
             LITERAL | TOKEN | ELLIPSIS | BINARY | TAG-COMMENT)
TAG-COMMENT ::= "--"  (ANY except <>)* "--"
ELLIPSIS  ::= "..."
IN-DATA  ::= (ANY | REFERENCE)          // so: ANY = [^<>&]
TOKEN    ::= (ANY except "="| REFERENCE) // so: ANY = [^<>&=”ws]
LITERAL  ::= “"” (ANY except "=" | REFERENCE)* “"”
                                        // so: ANY = [^<>&=”ws]
BINARY   ::= "&#[" (ANY except "]<>" )+  "]" 
REFERENCE::= “&” (ANY except "=")+ “;”  // so: ANY = [^<>&=”ws;]
TOKEN    ::= BYTE{1..256} | WORD{1..128}

There are two kinds of character references which can be used at any point in the document6, both in data and in tags, but are not then lexed as delimiters7:

  • Numeric character references (delimiters “&#x” and “;”) are the hex Unicode number.

  • Named character references (delimiters “&” and “;”) are the standard W3C entity set version.8

Full Tag Grammar

In the following, "ws" is any (ASCII) whitespace.9 

document  ::= head? ( body )
head      ::= ip+
body      ::= finite-stream | fragment* | scoped-element | element
finite-stream ::= f-stream-start-tag comment stream f-stream-end-tag
stream        ::= fragment+ 
f-stream-start-tag ::= “<<<<” name  primary-identifier? attributes “>>>>”
f-stream-end-tag   ::= “<<<</” name  primary-identifier? tag-comment* “>>>>” ip*

fragment  ::= fragment-start-tag comment* f-contents fragment-end-tag )
fragment-start-tag ::= “<<<” name  primary-identifier? attributes “>>>”
fragment-end-tag   ::= “<<</” name  primary-identifier? tag-comment “>>>” ip*
f-contents ::= (scoped-element)*  element*  

scoped-element ::= ( scope-start-tag comment contents scope-end-tag )  
scope-start-tag ::= “<<” name primary-identifier? attributes “>>” 
scope-end-tag ::= “<</” name  primary-identifier? tag-comment* “>>” 

element   ::= ( start-tag contents end-tag ) | list-content
start-tag ::= “<” name  primary-identifier? attributes “>”
end-tag   ::= “</” (name  primary-identifier?)? tag-comment “>”

contents  ::=  comment (  tags | TEXT | element | scoped-element  )

comment   ::= “<--" [^<>]* “-->”
list-content ::= “<_>“ (element | ws) “</_>” 
ip       ::= “<?” name attributes "-"+ “?>”
primary-identifier  ::= name ws* "==" ws value ws 
attributes  ::= (uuid | primary | key | reference | attribute | tag-comment) Ws
uuid        ::= name ws* "====" ws* value ws*
primary     ::= name ws* "===" ws* value ws*
key         ::= name ws* "==" ws* value ws*
reference   ::= name ws* "=}" ws* value ws*
attribute   ::= name  ws* "=" ws* value ws*  
tag-comment ::= "--" [^<>]* "--" ws*
 

name       ::= name-token(“:” name-token)? | literal
value      ::= lexically-typed-token | literal | tuple |binary-byte-string
literal    ::= “"” (TEXT except "=") “"”
tuple      ::=  "[" ( ws+ | lexically-typed-token | literal )* "]"
binary-byte-string ::= "&#[" ([01-9A-F]{2} )+ "]"
elipsis    ::= "..."
name-token ::= [^\.\-=<>][^-:=<>]*

See Extended Lexical Datatypes of the grammar for lexically-typed-tokens. These utilize various symbols, such as $, +, €, ¤, °, and so on, to select a datatype and parse the components.  This reduces markup, as the RAN datatypes implement more of the base standards than the common markup languages and XML Schemas datatypes.

  • All references start with delimiter & and end with ;  and "&" never is used literally. Use &ap;
  • All tags start with the delimiter < and end with the delimiter > and "<" and ">" are never used literally. Use &lt; and &gt;.
    • All delimiter recognition only takes 1 character lookahead (open delimiter <) or look behind, except for fragment tags and scoped -element tags.11
    • The matching tag delimiters pairs are </? >   <? ?> <! -> <: :> (
    • The delimiter set is <>&;”?!/:- []#%=. _}

Inside their delimiters, tags are lexed according to one of two lexical patterns:

  • NAMED-TAG-WITH-ATTRIBUTES for start-tags, end-tags and IP
  • FREE-TEXT for comments

There are three forms for elements:

  • normal elements: must have name token or string e.g. <a>xx</a> or <"a">xxx</"a">. 
  • list elements: must have “_” for the name, e.g. “<_>” and "</_>" . 
    • Non whitespace text directly under a list element is an error.
  • empty end-tag: must have name token or string for start-tag but not end: e.g. <a>xxx</>. These may not have sub-elements, and are provided for terseness in large records. 
    • An empty end tag coming after an end-tag or comment or ip is an error.
    •  The XML self-closing tag e.g. <a/> is not available: the closest is to use this: <a></>

Whitespace text runs

A Whitespace text run is any run of text between tags that only contains ASCII whitespace: such text is Ignorable (not part of the Infoset)  or Reportable. Not all Reportable whitespace is necessarily significant. Whether whitespace is Ignorable depends only on the two immediately surrounding tags.

  • Whitespace runs that appears between two start-tags (of any kind) or between two end-tags (of any kind) is Ignorable. I.e, 
    •  */node()[1][self::text()][normalize-space(.)='']  
    • */node()[last()][self::text()][normalize-space(.)='']  
  • Whitespace runs  that comes before or after a steam, fragment or scope start- or end-tag is Ignorable.
  • Other whitespace runs are Reportable: in particular, whitespace between an end- and a start-tag (internal whitespace) or between an  start- and an end-tag (blank element).

The intent is to  reduce the number of nodes in a DOM,  to allow fragments to represented as a simple list, and to make  indentation of the first few levels harmless.

Character References and Binary Byte Strings

There are three kinds of character references:

  • Named character reference
    • e.g., &mdash;
    • All  and only SGML/MathML/HTML public entities are built-in.
  • Hex Numeric character references
    • e.g., &#x00A1;
    • All non-control UNIX characters are allowed.  (i.e. not C0, C1, therefore no 0x00 NULL)
  • Binary Byte String
    • Is a kind of literal
    • e.g.,  &x[01ABCDEF]
    • A list of pairs of hexadecimal numbers, each specifying a byte. These can be any byte including 0x00 NULL.
    • If a Binary Byte String  is found, the parser can
      • generate a Fragment- or Scope- level WF error (default) and strip the reference 
      • OR, at system option, include the reference literally (as the actual characters), 
      • OR, at system option,  include the sequence of characters; this may be dangerous if used in some attack. It is only available as a data value

Character references are all dereferenced after all lexical operations are performed. Hence:

  • The only way to reresent the delimiters <>& in text is to use a character reference.
  • A character reference e.g. &lt;  is never  recognized as a delimiter of any kind, and never changes the recognition of markup.
  • Character references are supported in tokens (including element names), literals, data content, and implementation parameters.

The replacement UTF-8 for any character reference takes up  fewer bytes than the reference. So de-referencing can be done in a byte buffer without having to grow it.

Constraints:

  • The name in an end-tag must match12 that of the corresponding start-tag.
  • If a  fragment start- and end-tag have identifier attributes (using :=) they must match
  • An element or attribute name (TOKEN) must not exceed 256 bytes (UTF-8) or 128 ushorts (UTF-16) before any reference substitutions are made. 
  • Each tag or text run cannot exceed length 2^16  (64K)
  • Each fragment cannot exceed length 2^32  - 2^12 (4 Gig -4k)
    • The -4k is to ensure the document can start anywhere in the first page (4k) for potential alignment purposes.
  • The grammar gives the lexical detection rules for boolean, date, number and name. The value should be checked for full lexical correctness before it is first used, but there is no requirement to check full lexical correctness or value-validity if the datatype is no accessed.

  • An attribute specified as x=y is not the same as x=“y“ nor “x“=y nor “x“=“y“.13

  • A TOKEN according to the Tokenizing Grammar (an element or attribute name or attribute value without double quotes) has an absolute length restriction of 256 bytes as raw text, before character reference substitution14 and a token acccording to the Full Grammar has a maximum length of 256 Unicode characters after substitution.15

  • APIs exposing the data should expose name tokens in an NFC normalized form.16

Possible Parser Interface (TBD)

(Probably remove.)

This Russian Doll designs allows streamlined access to individual items.  For example, fast iterators or cursors such as, which can have no argument or take a name and/or ID:

  • ran::getNextStartFragmentInDocument()  -> Fragment
  • ran::getNextStartContextInFragment() -> Context
  • ran:getNextStartElementInContext()->StartElement
  • ran:getNextAttributeInStartTag() -> Attribute
  • ran:getNextIdInStartTag() -> ID
  • ran:getNextLinkInStartTag() -> Link 

Or chainable:   /*/*[@i="drugs"]/*[@i="pain"]/drug[@i="aspirin"][@dangeous="NO"]/@dosage

d: RanDocument =9 new RanDocument(...);
q: Quantity?  =  
d.getNextFragmentStart("", "drugs")             iterate through fragments stag
 .getNextContextStart("", "pain", LAZY)         iterate through contexts stag
 .getNextElementStart("drug", "aspirin", FORGET)    iterate through element stag
 .getElementNode(LAZY | DESC)                   get node or parse if not parsed
 .whereAttribute("dangerous:", Logic("NO"))     condition
 .getNextAttribute("dosage","*")                iterate through attributes
 .getTypedValue(FORGET);                        get quantity

where the string is a regex that matches an id (@i or first position :=) flags are different parse actions:

  • FORGET - do not update info or objects on intermediate tags and data, until the one is found
  • LAZY - do not parse attributes except id, nor child elements, until the one is found
  • (none) - parse all elements or  attributes until the one is found
  • TYPE - parse all elements or  attributes and do datatyping until the one is found
  • FULL - pre-parse all element atributes and do datatyping

and different scope 

  • (none) - just this tag or node
  • DESC - do same on descendents
  • SBL - do same on sibling
  • BRANCH - do same on entire branch that contains this
  • SCOPE - do same on entire scope
  • FRAGMENT - do same on entire fagment
  • ALL - do same on stream

Generic Master Parsing Function (TBD)

(probably remove)

A master generic parser function that combines parsing and simple query might be:

  • parseRanDocument(
         from: Node,
        direction: enum { forwards | backwards },
        fragment_ids: []ID,
       ids:[]ID,
       produce_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, all_elements, attributes, children, typed-attributes, instruction-parameters, instruction-parameter-attributes, comments } ,
    extended: bitset { table, CSV, rich-text, ... },
       validate_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, attributes, typed-attributes, note-attributes,instruction-parameters, instruction-parameter-attributes, comments } ,
      validation: bitset { fatal, error, warning, info, hint, verbose }
    )
                  => Result {
                                  (validityReport,   fragmentNode(), Node(), extended data )*,
                                 status?}  

This function would

(a) Scan the document (backwards or forwards from some position in the document) for the fragments with the given fragment IDs

(b) Parse the fragment to the extent needed to produce the particular nodes specified in “produce_nodes”, producing the kinds of nodes specified by produce_nodes

  • E.g. produce_nodes of FRAGMENTS|SCOPED-ELEMENTS|ELEMENTS|NEXT_SIBLING_CONTEXT would, if looking for element with ID of “x123” produce a tree of nodes with the fragment, and any element or scoped-elements between the fragment and the identified element,  the identified element, the immediate sibling elements of these,  but no other child nodes. It would not produce attribute values.  If this result was copied by value, it would lose the ability to have the other parts lazily evaluation. Otherwise, asking for an attribute value would cause parsing of the values. 

(c) The document is also validated, using the same options. 

  • So the information validated can be different (less or more) than the information produced. For example, from  “validate the whole document but do not produce a parse tree” to "produce a complete parse tree but do not validate it."

The intent is that minimum-necessary substree can be extracted at parse time, with the maximum appropriate laziness, for a particular task. 

Footnotes

1This makes it straightforward for developers to exploit optimizations available on modern CPU and disks: SIMD, Vectorization, parallel threads, SPMD, predictable prefetch, cache sizes, locality from in-place referencing.

2This grammar can be used to locate fragments and element tags.

3Implementation Note: So the implementation is free to use NULL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS or their code points. Their presence in the stream is terminal for the stream but not an error if all fragments and tags are complete. The other 19 non-whitespace controls are deprecated but unchecked, for efficiency.
An implementation may also check for the C1 and the other non-whitespace C0 control characters.

4If the raw data is UTF-8 then the UNIT is byte, a character is a sequence of all 8-bit units. If the raw data is UTF-16, the UNIT are all 16-bit units (in the case of a PUA or non-BMP Unicode repertiore character, the character may take more than one 16-bit units.) The intent is that an implementation supports one or the other or, preferably, both.

5This grammar can be used to locate fragment and element tags, tokenize start-tags and handle references.

6The use of references in names and ids is allowed but deprecated: it could break raw scanning.

7Implementation impact: This has the effect that the document can be navigated entirely using delimiters. There are no nested entities.

8Implementation Note: The size of the character reference replacement text is never more than the character reference itself. Character reference resolution of a string can be performed in-place, without needing a new buffer.

9These grammar productions should be interpreted as mutually excluding: an identifier required by a prior production is excluded from a subsequent production: e.g. a NAMED-TAG-WITH-FREE-CONTENT has an implicit exclusion of “>”. Reference substitution occurs in tokens and type-values before sub-parsing and TEXT.

11Implementation Note: because fragments cannot nest, the next << after a fragment-start’s << must be the opening delimiter of the corresponding fragment-close. So for linear lexing, the delimiters effectively only need 1 character lookahead from the initial <, with the “/” being a confirming check on the even fragment tags.

12An implementer may use any form of matching (raw bytes, character-refence-substituted bytes, NFC-normalized, etc.) as suits the implementation and deployment goals of the system.

13However, if converting it to some other form without datatyped values or quoted element names, an implementer may decide that they are equivalent.

14Implementation Note: Therefore the whole of any raw token can be read into a 128 byte SIMD vector, before and after character reference substitution, and before and after any Unicode Normalization.

15Implementation Note: Therefore a raw token converted from UTF-8 to UTF-16 can always be contained  two 128 byte SIMD vectors. For example, a name with only Latin alphabetic characters is restricted to 128 Unicode characters; a name with only Han ideographs is restricted to 256/3Inside literals in tags (e.g. attribute values), the “=” character cannot be used literally: use &eq;.Inside literals in tags (e.g. attribute values), the “=” character cannot be used literally: use &eq;.=85 characters. This may also allow convenient SIMD filtering to exclude, e.g. ASCII only or Han-only tokens from attempts to normalize them.

16Implementation Note: Normal text content and the content quoted attributes should not be normalized by APIs. If the data source is private or trusted to produce normalized names in tokens, an implementation may bypass normalization.

17Implementation Note: This functionality must be implemented and exposed in some API. Given a reference to F1:X123 the value can be found in unparsed raw text by by first scanning for the fragment (a linear scanning of the text for “<<<[^/] until the fragment with @id of “F1” is found, then scanning start-tags until the next “>” for the first attribute value of “X123”. This is a rough-and-ready text operation that can be performed on the raw text, to simulate ID/IDREF or keyed links, and relies on the reference attribute value to be unique among all attribute values in the fragment (not only ID values: it has nothing to do with ID types).