Grammars

A RAN stream can be processed at many levels of granularity, depending on the requirement.

The most basic division is that a RAN document is made from a number of independent partitions called fragments.

RAN can be lexed at many different grains, the most important of which is below. In general, there are six basic levels of granularity:

Finite-stream tags (look for <<<< at start and end only)
Fragment identification (look for <<<)
Tag identification (look for <)
Context tags (look for <<)
Attribute tokenizing (look for spaces, =, >, " etc.)
Attribute indicator link-typing

Then there are several levels of “notation” parsing that may be done by the parser, or lazily on demand, or or by some application layer.

Attribute-value data-typing (see section Extended Lexical Datatypes)
Embedded CSV (see section RAN-CSV)
Single-character rich text tags (quasi HTML)
Potentially, other notations

An individual partition (i.e. a section of the document that starts with the fragment-start-tag open-delimiter "<<<]") may be ill-formed without this making any subsequent partition ill-formed. Each partition can be parsed separately.

In RAN, end-tags (element, scoped-element, fragment) must have a duplicate of the initial ID attribute on the corresponding open-tag. Fragments and scoped-elements must have IDs.

An attribute specified as x=y is not the same as x=“y“ nor “x“=y nor “x“=“y“.¹³

Simple Scanning by Regex

The ability to parse to milestones using regular expressions, then at or between those milestones match values using the Tokenizing grammar means that easy queries can be performed on the raw text array without allocating objects.¹

Scan for tags (Regular Expression)

The simplest complete lexical scanner breaks the document into sequences of [tag-open delimiter, tag contents, tag-close delimiter, data ] capture-group tuples.

stream  := ws* ( 
                 ( "<"+ "/"? "?"? ) 
                 ( [^<>]* )                       TAG: Max 2^16 bytes 
                 ( ">"+ )?  
                 ( [^<>]* )                       DATA: Max 2^16 bytes
                )+  
            0x00 -0x20                            END

Here is a more complex scanner that also tokenizes inside tags: attributes, PIs and comments. The rules here implement the intended semantic that a < or > delimiter always closes the tag, so provides error recovery.

stream  := \s* (                                        IGNORABLE WHITESPACE
                ("<--" [^<>]* "-->")   |                COMMENT
                (
                 ( "<"+ ("/" | "?" |"_" )? )             TAGO
                 (
                  ( [^<>\s\"]+ )                          GI =< 256 bytes
                  (\s+                                    IGNORABLE WHITESPACE
                   ("\"" [^<>\"\s=]+ "\"" ) |            LITERAL
                   ("="+ "}"? ) |                       VALUE INDICATOR
                   "[" | "]" |                           TUPLE DELIMITER
                   "..." |                               ELLIPSIS
                   "?" |                                 EXTERNAL
                   ( [^\"=\[\]\.\?\s<>][^<>\s\"=]* ) |   TOKEN   
                   \s+                                   IGNORABLE WHITESPACE
                  )*
                 )?
                 ( ("?"? ">"+ ) |                        TAGC (1 expected)
                  ( [^<>]+ )                             DATA 
                 )*
                ) 
              )+  
               0x00 -0x20                                END

A TOKEN is any run of characters that is not ASCII whitespace, <, > can further be lexically typed using the following, tested in order:

starts with +- or digit then is QUANTITY e.g. 0 or +1 or 440_Hz/m
contains - not at start or end then is ISO 8601 DATE TIME RANGE e.g. 2024-08-27 or
contains + or ~ not at start or end then is ANCHOR PATH e.g. fred+ginger
contains “/” then is URI e.g. /x/y
contains “:” then is PREFIXED NAME e.g. fred:ginger
starts with $ then is MONEY e.g. $26.00 or $26.00_AU_2024-16-01
starts with LATIN 1 block character then is SIMPLE (typically an ASCII identifier, logic value etc)
starts and ends with “%” then is EXTERNAL
starts with Unicode currency (any value in currency block) then is MONEY
otherwise is SIMPLE

SIMPLE can further be broken down into reserved and pre-defined logic values, otherwise are NAMEs.

Scan for Fragment boundaries only (Regular Expression)

stream ::= .*(<<<[^/]*)*

or
stream ::= .*((<<<[^/]*<<<[/].*>>>).*)*

Fragments cannot nest, so matching start and end tags is reliable. Fragment comments, PIs, etc are excluded.

Scan for Fragment Tags only (Regular Expression)

stream ::= .*(<<<[^<].+>>>[^>].*)*

Scan for Scoped Element boundaries in Fragments (Regular Expression)

fragment contents ::= .*(<<[^/].+.*)*

Scan for any Tags in Fragments (Regular Expression)

fragment contents ::= .*((<<?.+>>?).*)*

Scan for any Tags in Scoped Elements or Fragments with no Scoped Elements (Regular Expression)

element contents ::= .*((<.+>).*)*

Scan for Tags (Regular Expression): ²

stream ::= \s*((<[<[<]?]?.+>[>[>]?]?).*)*

Encoding Grammar:³

stream ::= UNIT*  
UNIT ::= U8* | U16* 4
U8 ::= BYTE > 8
U16 ::= WORD > 8

Complete Tokenizing

Outer tokenizing (regular expression)

DOCUMENT   ::= ws? ( MARKUP | DATA )+ 
MARKUP   ::= ( TAG  | COMMENT | IP )*   
TAG      ::= TAGO IN-MARKUP TAGC
TAGO     :: STREAM-TAGO | FRAGMENT-TAGO | SCOPED-TAGO | GENERAL-TAGO 
TAGC     :: STREAM-TAGC | FRAGMENT-TAGC | SCOPED-TAGC | GENERAL-TAGC

STREAM-TAGO    ::= "<<<<" "/"?
FRAGMENT-TAGO  ::= "<<<" "/"?
SCOPED-TAGO    ::= "<<" "/"?
GENERAL-TAGO   ::= "<" "/"?
STREAM-TAGC    ::= ">>>>" 
FRAGMENT-TAGC  ::= ">>>" 
SCOPED-TAGC    ::= ">>" 
GENERAL-TAGC   ::= ">" 

IP       ::= "<?"  IN-MARKUP "?>"
COMMENT  ::=  "<--" ANY* "-->" 

IN-MARKUP  ::= [^<>]+
DATA  ::= [^<>]+

Inner tokenizing (regular expression)

IN-MARKUP  ::=  TAG-NAME (ws+ (TAG-TOKEN ws*)* )?
TAG-NAME   ::=  LITERAL | LEXICAL-TOKEN

TAG-TOKEN ::=  INDICATOR | LITERAL | ELLIPSIS | TUPLE | LEXICAL-TOKEN 
INDICATOR ::=  “="+ "}”? 
ELLIPSIS  ::= "..." 
LITERAL  ::= “\"” [^<>=\”\S\]]{1..255}  “\"”
                  
TUPLE   ::= ("["  [^<>\]\[] "]" )+
LEXICAL-TOKEN    ::= [^<>”\]\[]{1..255} 
ws       ::= "[\x00-\x20]"

Tokenizing grammar: (regular expression)⁵

DOCUMENT   ::= ws? ( MARKUP | DATA )+ 
MARKUP   ::= ( TAG  | COMMENT | IP )*   
TAG      ::= TAGO IN-MARKUP TAGC
TAGO     :: STREAM-TAGO | FRAGMENT-TAGO | SCOPED-TAGO | GENERAL-TAGO 
TAGC     :: STREAM-TAGC | FRAGMENT-TAGC | SCOPED-TAGC | GENERAL-TAGC

STREAM-TAGO    ::= "<<<<" "/"?
FRAGMENT-TAGO  ::= "<<<" "/"?
SCOPED-TAGO    ::= "<<" "/"?
GENERAL-TAGO   ::= "<" "/"?
STREAM-TAGC    ::= ">>>>" 
FRAGMENT-TAGC  ::= ">>>" 
SCOPED-TAGC    ::= ">>" 
GENERAL-TAGC   ::= ">" 

IP       ::= "<?"  IN-MARKUP "?>"
COMMENT  ::=  "<--" ANY* "-->"

IN-MARKUP  ::=  TAG-NAME (ws+ (TAG-TOKEN ws*)* )?
TAG-NAME   ::=  LITERAL | LEXICAL-TOKEN

TAG-TOKEN ::=  INDICATOR | LITERAL | ELLIPSIS | TUPLE | LEXICAL-TOKEN 
INDICATOR ::=  “="+ "}”? 
ELLIPSIS  ::= "..." 
LITERAL  ::= “\"” [^<>=\”\S\]]{1..255}  “\"”
                  
TUPLE   ::= ("["  [^<>\]\[] "]" )+
LEXICAL-TOKEN    ::= [^<>”\]\[]{1..255} 
ws       ::= "[\x00-\x20]"

There are two kinds of character references which can be used at any point in the document⁶, both in data and in tags, but are not then lexed as delimiters⁷:

Numeric character references (delimiters “&#x” and “;”) are the hex Unicode number.
Named character references (delimiters “&” and “;”) are the standard W3C entity set version.⁸

Full Tag Grammar

In the following, "ws" is any (ASCII) whitespace.⁹

document  ::= head? ( body? ) end?
head      ::= meta+
body      ::= finite-stream | fragment* | scoped-element | general-element
end       ::= 0x00  - 0x20

finite-stream ::= f-stream-start-tag meta* stream-contents f-stream-end-tag meta*
stream-contents    ::= fragment+ | scoped-element | general-element 
f-stream-start-tag ::= “<<<<” name  attribute+ ellipsis? “>>>>”
f-stream-end-tag   ::= “<<<</” name  attribute “>>>>” 

fragment  ::= fragment-start-tag meta* f-contents fragment-end-tag meta*
fragment-start-tag ::= “<<<” name  attribute+ ellipsis? “>>>”
fragment-end-tag   ::= “<<</” name  attribute “>>>” 
f-contents ::= (scoped-element | general-element) * 

scoped-element ::= scope-start-tag meta* contents scope-end-tag  meta*
scope-start-tag ::= “<<” name attribute+ ellipsis? “>>” 
scope-end-tag ::= “<</” name  attribute  “>>” meta*

general-element   ::=  start-tag contents end-tag 
start-tag ::= “<” name  attribute* ellipsis? “>”
end-tag   ::= “</” name  attribute? “>”

contents  ::=  (  tags | TEXT | general-element | scoped-element  )*

meta           ::= ws | comment | ip  
comment        ::= “<--" [^<>]* “-->”
ip             ::= “<?” name attribute* "-"+ “?>”

primary-identifier  ::= uuid | primary | key 
attribute     ::= (uuid | primary | key | reference | attribute ) 
uuid           ::= name ws* "====" ws* value ws*
primary        ::= name ws* "===" ws* value ws*
key            ::= name ws* "==" ws* value ws*
reference      ::= name ws* "=}" ws* value ws*
attribute      ::= name  ws* "=" ws* value ws   

name          ::= symbol| literal  
value         ::= lexical-datatype | token | literal | tuple+  
literal       ::= “"” any* “"” ws*
tuple         ::=  "[" ( ws+ | token | literal ) "]" ws*

ellipsis       ::= "..." ws*

symbol        ::= ([^\.=<>][^\-:=<>]*/)    // no lexical delimiters            
lexical-token ::= ([^\.<>][^\-:=<>]*/)

See Extended Lexical Datatypes of the grammar for lexically-typed-tokens. These utilize various symbols, such as $, +, €, ¤, °, and so on, to select a datatype and parse the components. This reduces markup, as the RAN datatypes implement more of the base standards than the common markup languages and XML Schemas datatypes.

All references start with delimiter & and end with ; and "&" never is used literally. Use ≈
All tags start with the delimiter < and end with the delimiter > and "<" and ">" are never used literally. Use < and >.

All delimiter recognition only takes 1 character lookahead (open delimiter <) or look behind, except for fragment tags and scoped -element tags.¹¹
The matching tag delimiters pairs are </? > <? ?> <! -> <: :> (
The delimiter set is <>&;”?!/:- []#%=. _}

Inside their delimiters, tags are lexed according to one of two lexical patterns:

NAMED-TAG-WITH-ATTRIBUTES for start-tags, end-tags and IP
FREE-TEXT for comments

There are three forms for elements:

normal elements: must have name token or string e.g. <a>xx</a> or <"a">xxx</"a">.
list elements: must have “_” for the name, e.g. “<_>” and "</_>" .

Non whitespace text directly under a list element is an error.

empty end-tag: must have name token or string for start-tag but not end: e.g. <a>xxx</>. These may not have sub-elements, and are provided for terseness in large records.

An empty end tag coming after an end-tag or comment or ip is an error.
The XML self-closing tag e.g. <a/> is not available: the closest is to use this: <a></>

Whitespace text runs

(To be checked) A Whitespace text run is any run of text between tags that only contains ASCII whitespace: such text is Ignorable (not part of the Infoset) or Reportable. Not all Reportable whitespace is necessarily significant. Whether whitespace is Ignorable depends only on the two immediately surrounding tags.

Whitespace runs that appears between two start-tags (of any kind) or between two end-tags (of any kind) is Ignorable. I.e,

*/node()[1][self::text()][normalize-space(.)='']
*/node()[last()][self::text()][normalize-space(.)='']

Whitespace runs that comes before or after a steam, fragment or scope start- or end-tag is Ignorable.
Other whitespace runs are Reportable: in particular, whitespace between an end- and a start-tag (internal whitespace) or between an start- and an end-tag (blank element).

The intent is to reduce the number of nodes in a DOM, to allow fragments to represented as a simple list, and to make indentation of the first few levels harmless.

Character References and Binary Data

There are three kinds of character references:

Named character reference

e.g., —
All and only SGML/MathML/HTML public entities are built-in.

Hex Numeric character references

e.g., ¡
All non-control UNIX characters are allowed. (i.e. not C0, C1, therefore no 0x00 NULL)

Binary String

A tuple can contain a sequence of hexadecimal numbers. Such as [ 0x0123456A 0x0123456b ]. There is no restriction on these numbers to prevent 0 etc.
A RAN DOM implementation can provide utility functions on an attribute that if all the tokens in a tuple are power of 2 in length and are hex numbers, then they can be converted to an array of binary data: uint8, uint16, uint32, uint64. etc. The function returns an array of the appropriate size. For example, toBinaryBytes(), toBinaryShort(), toBinaryInt(), to BinaryLong() etc.
These can only be used as attribute values, and so may not be

Character references are all dereferenced after all lexical operations are performed. Hence:

The only way to reresent the delimiters <>& in text is to use a character reference.
A character reference e.g. < is never recognized as a delimiter of any kind, and never changes the recognition of markup.
Character references are supported in tokens (including element names), literals, data content, and implementation parameters.

The replacement UTF-8 for any character reference takes up fewer bytes than the reference. So de-referencing can be done in a byte buffer without having to grow it.

Constraints:

Implementations can rely on that (for a well-formed document):

A character entity reference never expands to a direct character that uses more bytes than the reference.

Implication: Reference expansion can be done in place

An element or attribute name (Lexical TOKEN) never exceed 256 bytes (UTF-8) (and therefore 256 characters) or before any reference substitutions are made.

Implication 8 bit indexes can be used into tokens
A Lexical TOKEN according to the Tokenizing Grammar (an element or attribute name or attribute value without double quotes) has an absolute length restriction of 256 bytes as raw text, before character reference substitution¹⁴ and a token acccording to the Full Grammar has a maximum length of 256 Unicode characters after substitution.¹⁵

Each tag or text run never exceeds length 2^ 16, i.e. 64K

Implication: 16 bit indexes an be used into tags and tokens

Each fragment or scope or element or tag or data run cannot exceed length 2^31 bytes, i.e., 2 Gig

Implication: 31bit uint index can be used
The edge case of is a document containing entirely of a single comment.

A RAN parser can reject documents that are over 2 Gig if they cannot support that size.

Tree constraints

The name in an end-tag must match¹² that of the corresponding start-tag.
If the identifier of a stream, fragment or scope start-tag wiil match that in the end-tag.

Parsing simplification

There are no elaborate checks for which characters can appear in names or strings, only what is needed for parsing.
The grammar gives the lexical detection rules for boolean, date, number and name. The value should be checked for full lexical correctness before it is first used, but there is no requirement to check full lexical correctness or value-validity if the datatype is no accessed.
The characters <> and & are only and in every context treated as delimiters,
Whitespace is any character under 0x20

Possible Parser Interface (TBD)

(Probably remove.)

This Russian Doll designs allows streamlined access to individual items. For example, fast iterators or cursors such as, which can have no argument or take a name and/or ID:

ran::getNextStartFragmentInDocument() -> Fragment
ran::getNextStartContextInFragment() -> Context
ran:getNextStartElementInContext()->StartElement
ran:getNextAttributeInStartTag() -> Attribute
ran:getNextIdInStartTag() -> ID
ran:getNextLinkInStartTag() -> Link

Or chainable: /*/*[@i="drugs"]/*[@i="pain"]/drug[@i="aspirin"][@dangeous="NO"]/@dosage

d: RanDocument =9 new RanDocument(...);
q: Quantity?  =  
d.getNextFragmentStart("", "drugs")             iterate through fragments stag
 .getNextContextStart("", "pain", LAZY)         iterate through contexts stag
 .getNextElementStart("drug", "aspirin", FORGET)    iterate through element stag
 .getElementNode(LAZY | DESC)                   get node or parse if not parsed
 .whereAttribute("dangerous:", Logic("NO"))     condition
 .getNextAttribute("dosage","*")                iterate through attributes
 .getTypedValue(FORGET);                        get quantity

where the string is a regex that matches an id (@i or first position :=) flags are different parse actions:

FORGET - do not update info or objects on intermediate tags and data, until the one is found
LAZY - do not parse attributes except id, nor child elements, until the one is found
(none) - parse all elements or attributes until the one is found
TYPE - parse all elements or attributes and do datatyping until the one is found
FULL - pre-parse all element atributes and do datatyping

and different scope

(none) - just this tag or node
DESC - do same on descendents
SBL - do same on sibling
BRANCH - do same on entire branch that contains this
SCOPE - do same on entire scope
FRAGMENT - do same on entire fagment
ALL - do same on stream

Generic Master Parsing Function (TBD)

(probably remove)

A master generic parser function that combines parsing and simple query might be:

parseRanDocument(
     from: Node,
    direction: enum { forwards | backwards },
    fragment_ids: []ID,
   ids:[]ID,
   produce_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, all_elements, attributes, children, typed-attributes, instruction-parameters, instruction-parameter-attributes, comments } ,
extended: bitset { table, CSV, rich-text, ... },
   validate_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, attributes, typed-attributes, note-attributes,instruction-parameters, instruction-parameter-attributes, comments } ,
validation: bitset { fatal, error, warning, info, hint, verbose }
)
              => Result {
                              (validityReport,   fragmentNode(), Node(), extended data )*,
                             status?}

This function would

(a) Scan the document (backwards or forwards from some position in the document) for the fragments with the given fragment IDs

(b) Parse the fragment to the extent needed to produce the particular nodes specified in “produce_nodes”, producing the kinds of nodes specified by produce_nodes

E.g. produce_nodes of FRAGMENTS|SCOPED-ELEMENTS|ELEMENTS|NEXT_SIBLING_CONTEXT would, if looking for element with ID of “x123” produce a tree of nodes with the fragment, and any element or scoped-elements between the fragment and the identified element, the identified element, the immediate sibling elements of these, but no other child nodes. It would not produce attribute values. If this result was copied by value, it would lose the ability to have the other parts lazily evaluation. Otherwise, asking for an attribute value would cause parsing of the values.

So the information validated can be different (less or more) than the information produced. For example, from “validate the whole document but do not produce a parse tree” to "produce a complete parse tree but do not validate it."

The intent is that minimum-necessary substree can be extracted at parse time, with the maximum appropriate laziness, for a particular task.

Footnotes

1This makes it straightforward for developers to exploit optimizations available on modern CPU and disks: SIMD, Vectorization, parallel threads, SPMD, predictable prefetch, cache sizes, locality from in-place referencing.

2This grammar can be used to locate fragments and element tags.

3Implementation Note: So the implementation is free to use NULL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS or their code points. Their presence in the stream is terminal for the stream but not an error if all fragments and tags are complete. The other 19 non-whitespace controls are deprecated but unchecked, for efficiency.
An implementation may also check for the C1 and the other non-whitespace C0 control characters.

4If the raw data is UTF-8 then the UNIT is byte, a character is a sequence of all 8-bit units. If the raw data is UTF-16, the UNIT are all 16-bit units (in the case of a PUA or non-BMP Unicode repertiore character, the character may take more than one 16-bit units.) The intent is that an implementation supports one or the other or, preferably, both.

5This grammar can be used to locate fragment and element tags, tokenize start-tags and handle references.

6The use of references in names and ids is allowed but deprecated: it could break raw scanning.

7Implementation impact: This has the effect that the document can be navigated entirely using delimiters. There are no nested entities.

8Implementation Note: The size of the character reference replacement text is never more than the character reference itself. Character reference resolution of a string can be performed in-place, without needing a new buffer.

9These grammar productions should be interpreted as mutually excluding: an identifier required by a prior production is excluded from a subsequent production: e.g. a NAMED-TAG-WITH-FREE-CONTENT has an implicit exclusion of “>”. Reference substitution occurs in tokens and type-values before sub-parsing and TEXT.

11Implementation Note: because fragments cannot nest, the next << after a fragment-start’s << must be the opening delimiter of the corresponding fragment-close. So for linear lexing, the delimiters effectively only need 1 character lookahead from the initial <, with the “/” being a confirming check on the even fragment tags.

12An implementer may use any form of matching (raw bytes, character-refence-substituted bytes, NFC-normalized, etc.) as suits the implementation and deployment goals of the system.

13However, if converting it to some other form without datatyped values or quoted element names, an implementer may decide that they are equivalent.

14Implementation Note: Therefore the whole of any raw token can be read into a 128 byte SIMD vector, before and after character reference substitution, and before and after any Unicode Normalization.

15Implementation Note: Therefore a raw token converted from UTF-8 to UTF-16 can always be contained two 128 byte SIMD vectors. For example, a name with only Latin alphabetic characters is restricted to 128 Unicode characters; a name with only Han ideographs is restricted to 256/3Inside literals in tags (e.g. attribute values), the “=” character cannot be used literally: use &eq;.Inside literals in tags (e.g. attribute values), the “=” character cannot be used literally: use &eq;.=85 characters. This may also allow convenient SIMD filtering to exclude, e.g. ASCII only or Han-only tokens from attempts to normalize them.

16Implementation Note: Normal text content and the content quoted attributes should not be normalized by APIs. If the data source is private or trusted to produce normalized names in tokens, an implementation may bypass normalization.

17Implementation Note: This functionality must be implemented and exposed in some API. Given a reference to F1:X123 the value can be found in unparsed raw text by by first scanning for the fragment (a linear scanning of the text for “<<<[^/] until the fragment with @id of “F1” is found, then scanning start-tags until the next “>” for the first attribute value of “X123”. This is a rough-and-ready text operation that can be performed on the raw text, to simulate ID/IDREF or keyed links, and relies on the reference attribute value to be unique among all attribute values in the fragment (not only ID values: it has nothing to do with ID types).