Grammars

A RAN stream can be processed at many levels of granularity, depending on the requirement.

The most basic division is that a RAN document is made from a number of independent partitions called fragments.

RAN can be lexed at many different grains, the most important of which is below. In general, there are five basic levels of granularity:

  • Fragment identification  (look for <<<)
  • Tag identification  (look for <)
  • Context tags (look for <<)
  • Attribute tokenizing  (look for spaces, =, >, " etc.)
  • Attribute indicator link-typing

Then there are several  levels of “notation” parsing that may be done by the parser, or lazily on demand, or  or by some application layer.

  • Attribute-value data-typing  (see section Extended Lexical Datatypes)
  • Embedded CSV  (see section RAN-CSV)
  • Single-character rich text tags (quasi HTML)
  • “Note attributes", "PI Attributes" 
  • Potentially, other notations 

An individual partition (i.e. a section of the document that starts with the fragment-start-tag open-delimiter "<<<]") may be ill-formed without this making any subsequent partition ill-formed. Each partition can be parsed separately.

In RAN, end-tags (element, scoped-element, fragment) must have a duplicate of the initial ID attribute on the corresponding open-tag. Fragments and scoped-elements must have IDs. 

This Russian Doll designs allows streamlined access to individual items.  For example, fast iterators or cursors such as, which can have no argument or take a name and/or ID:

  • ran::getNextStartFragmentInDocument()  -> Fragment
  • ran::getNextStartContextInFragment() -> Context
  • ran:getNextStartElementInContext()->StartElement
  • ran:getNextAttributeInStartTag() -> Attribute
  • ran:getNextIdInStartTag() -> ID
  • ran:getNextLinkInStartTag() -> Link 

Generic Master Parsing Function

A master generic parser function that combines parsing and simple query might be:

  • parseRanDocument(
         from: Node,
        direction: enum { forwards | backwards },
        fragment_ids: []ID,
       ids:[]ID,
       produce_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, all_elements, attributes, children, typed-attributes, note-attributes, processing-instructions, processing-instruction-attributes, comments, links } ,
    extended: bitset { table, CSV, rich-text, ... },
       validate_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, attributes, typed-attributes, note-attributes, processing-instructions, processing-instruction-attributes, comments, links } ,
      validation: bitset { fatal, error, warning, info, hint, verbose }
    )
                  => Result {
                                  (validityReport,   fragmentNode(), Node(), extended data )*,
                                 status?}  

This function would

(a) Scan the document (backwards or forwards from some position in the document) for the fragments with the given fragment IDs

(b) Parse the fragment to the extent needed to produce the particular nodes specified in “produce_nodes”, producing the kinds of nodes specified by produce_nodes

  • E.g. produce_nodes of FRAGMENTS|SCOPED-ELEMENTS|ELEMENTS|NEXT_SIBLING_CONTEXT would, if looking for element with ID of “x123” produce a tree of nodes with the fragment, and any element or scoped-elements between the fragment and the identified element,  the identified element, the immediate sibling elements of these,  but no other child nodes. It would not produce attribute values.  If this result was copied by value, it would lose the ability to have the other parts lazily evaluation. Otherwise, asking for an attribute value would cause parsing of the values. 

(c) The document is also validated, using the same options. 

  • So the information validated can be different (less or more) than the information produced. For example, from  “validate the whole document but do not produce a parse tree” to "produce a complete parse tree but do not validate it."

The intent is that minimum-necessary substree can be extracted at parse time, with the maximum appropriate laziness, for a particular task. 

Simple Scanning

The ability to parse to milestones using regular expressions, then at or between those milestones match values using the Tokenizing grammar means that easy queries can be performed on the raw text array without allocating objects.1

Scan for Fragment boundaries  (Regular Expression)

stream ::= .*(<<<[^/!?:].*)*

or
stream ::= .*((<<<[^/!?:].*<<<[/].*>>>).*)*

Fragments cannot nest, so matching start and end tags is reliable. Fragment comments, PIs, etc are excluded.

Scan for Fragment Tags (Regular Expression)

stream ::= .*(<<<.+>>>.*)*

Scan for Scoped Element boundaries in Fragments (Regular Expression)

fragment contents ::= .*(<<[^/].+.*)*

Scan for any Tags in Fragments (Regular Expression)

fragment contents ::= .*((<<?.+>>?).*)*

Scan for any  Tags in Scoped Elements or Fragments with no Scoped Elements (Regular Expression)

element contents ::= .*((<.+>).*)*

Scan for Tags (Regular Expression): 2

stream ::= \s*((<[<[<]?]?.+>[>[>]?]?).*)*

Encoding Grammar:3

stream ::= UNIT*
UNIT ::= U8* | U16*
4
U8 ::= BYTE > 8
U16 ::= WORD > 8

Tokenizing

Tokenizing grammar:5

stream   ::= ws* ( (TAGO IN-MARKUP TAGC) | LIST-OPEN | LIST-CLOSE) IN-DATA 
LIST-OPEN ::= "<["
LIST-CLOSE ::= "]>"
TAGO     ::= “<”{1..3} ( "?" | "!" | "-"*  )?
TAGC     ::= ("-"+ | "?"  )? “>”{1..3} 
IN-MARKUP::= (ws |  (“=" "="? ":”?) | (":=" "="?) |
    LITERAL | TOKEN | "[" | "]" | ELLIPSIS)*
ELLIPSIS  ::= "..."
IN-DATA  ::= (ANY | REFERENCE)          // so: ANY = [^<>&]
TOKEN    ::= (ANY except "="| REFERENCE) // so: ANY = [^<>&=”ws]
LITERAL  ::= “"” (ANY except "=" | REFERENCE)* “"”
                                        // so: ANY = [^<>&=”ws]
REFERENCE::= “&” (ANY except "=")+ “;”  // so: ANY = [^<>&=”ws;]
TOKEN    ::= BYTE{1..128} | WORD{1..64}

There are two kinds of character references which can be used at any point in the document6, both in data and in tags, but are not then lexed as delimiters7:

  • Numeric character references (delimiters “&#x” and “;”) are the hex Unicode number.

  • Named character references (delimiters “&” and “;”) are the standard W3C entity set version.8

Full Tag Grammar

In the following, "ws" is any (ASCII) whitespace.9 TODO: simplify 

stream    ::= head? partition
head      ::= preamble fluff?
partition::= (element-tags fluff?)? | (fragment fluff?)*
preamble  ::= ws* link? fluff?
fluff     ::= (comment | pi | ws)+
link      ::= “<:” NAMED-TAG-WITH-ATTRIBUTES “:>”
fragment  ::= ( fragment-start-tag f-contents fragment-end-tag )
fragment-start-tag ::= “<<<” name  scoped-unique-identifier LINKS-AND-ATTRIBUTES “>>>”
fragment-end-tag   ::= “<<</” name  scoped-unique-identifier  “>>>”
f-contents ::= (fragment-comment | fragment-pi | fluff)*, ( element | scoped-element | fluff )*
fragment-comment ::= "<<<!" "-"* UNNAMED-TAG "-"* "->>>" 
fragment-pi ::= "<<<?" name FREE-CONTENT "?>>>"   
scoped-element ::= ( scope-start-tag 
      (scoped-comment | scoped-pi | fluff )* contents scope-end-tag )  
scope-start-tag ::= “<<” name scoped-unique-identifier LINKS-AND-ATTRIBUTES “>>” 
scope-end-tag ::= “<</” name  scoped-unique-identifier   “>>”
scope-fragment-comment ::= "<<!" "-"* UNNAMED-TAG "-"* "->>" 
scope-fragment-pi ::= "<<?" name FREE-CONTENT "?>>" 

element   ::= ( start-tag contents end-tag ) | list-content
start-tag ::= “<” name  scoped-unique-identifier? LINKS-AND-ATTRIBUTES “>”
end-tag   ::= “</” name?  scoped-unique-identifier? “>”


contents  ::= ( comment | pi | tags | TEXT | element | scoped-element  )*

comment   ::= “<!” “-”* UNNAMED_TAG “-”* “->”
list-content ::= “<[“ (element | ws) “]>”
pi        ::= “<?” name FREE-CONTENT-OR-ATTRIBUTES  “?>” 


UNNAMED-TAG    ::= TEXT 
FREE-CONTENT   ::= (ws+ TEXT?)?
FREE-CONTENT-OR-ATTRIBUTES :== LINKS-AND-ATTRIBUTES | FREE-CONTENT
LINKS-AND-ATTRIBUTES::= ws? 
   (key | reference |  attribute | ellipsis |
   (content-key | content-reference  | content-attribute)
scoped-unique-identifier  ::= name ws* ":"? "=" ws* value ws* // := encouraged
key         ::= name ws* ":=" ws* value ws*
reference   ::= name ws* "=:" ws* value ws*
attribute   ::= name  ws* "=" ws* value ws* 

content-key        ::= name ws* ":==" ws* value ws*
content-reference  ::= name ws* "==:" ws* value ws* 
content-attribute  ::= name  ws* "==" ws* value ws*

name       ::= name-token(“:” name-token)? | literal
value      ::= lexically-typed-token | literal | tuple
literal    ::= “"” (TEXT except "=") “"”
tuple      ::=  "[" ( ws+ | lexically-typed-token | literal )* "]"
elipsis    ::= "..."
name-token ::= [^\.\-=][^-:=]*

See Extended Lexical Datatypes of the grammar for lexically-typed-tokens. These utilize various symbols, such as $, +, €, ¤, °, and so on, to select a datatype and parse the components.  This reduces markup, as the RAN datatypes implement more of the base standards than the common markup languages and XML Schemas datatypes.

  • All references start with delimiter & and end with ;  and "&" never is used literally. Use &ap;
  • All tags start with the delimiter < and end with the delimiter > and "<" and ">" are never used literally. Use &lt; and &gt;.
    • All delimiter recognition only takes 1 character lookahead (open delimiter <) or look behind, except for fragment tags and scoped -element tags.11
    • The matching tag delimiters pairs are </? >   <? ?> <! -> <: :>    (
    • The delimiter pair <[ ]> are a lexically independent start-tag and end-tag, not open and close delimiters. The delimiter set is <>&;”?!/:- []#=. 

Inside their delimiters, tags follow one of three lexical patterns:

  • NAMED-TAG-WITH-ATTRIBUTES such as a start-tag,
  • UNNAMED-TAG such as a comment, and
  • NAMED-TAG-WITH-FREE-CONTENT-OR-ATTRIBUTES such as a processing instruction or end-tag,
    • If the contents contains “=” then they are parsed as RAN attribute syntax (as “PI attributes” or "Note attributes", otherwise it is free text. The expectation is that these would be parsed on demand (as an invocation parameter), and that the simple parse tree for the fragment would not (as an invocation parameter) include this information.

Inside literals in tags (e.g. attribute values), the “=” character cannot be used literally: use &eq;.

Constraints:

  • The name in an end-tag must match12 that of the corresponding start-tag.
  • If a  fragment start- and end-tag have identifier attributes (using :=) they must match
  • An element or attribute name (TOKEN) must not exceed 128 bytes (UTF-8) or 64 ushorts (UTF-16) before any reference substitutions are made.  (Consequently, they cannot contain more than 64 characters after substitution, even if UTF-8 is used.)
  • The grammar gives the lexical detection rules for boolean, date, number and name. The value should be checked for full lexical correctness before it is first used, but there is no requirement to check full lexical correctness or value-validity if the datatype is no accessed.

  • An attribute specified as x=y is not the same as x=“y“ nor “x“=y nor “x“=“y“.13

  • Element end-tags and comments use the same internal syntax as processing instructions: they start with a name then may contain any text, including references, until the tag close delimiter “>” or “>>”. This free text is treated as a comment, not a PI or post-fixed attributes. (Implemention Note: The second token in a fragment close tag may, for example, contain a content checksum, made by adding all characters outside tags and all characters in attribute literals, with references resolved and whitespace stripped; the purpose of this is to detect data transmission errors or naive changes; it is not a security measure, not will it detect problems with whitespace corruption. )

  • A TOKEN according to the Tokenizing Grammar (an element or attribute name or attribute value without double quotes) has an absolute length restriction of 64 bytes as raw text, before character reference substitution14 and a token acccording to the Full Grammar has a maximum length of 64 Unicode characters after substitution.15

  • APIs exposing the data should expose name tokens in an NFC normalized form.16

Footnotes

1This makes it straightforward for developers to exploit optimizations available on modern CPU and disks: SIMD, Vectorization, parallel threads, SPMD, predictable prefetch, cache sizes, locality from in-place referencing.

2This grammar can be used to locate fragments and element tags.

3Implementation Note: So the implementation is free to use NULL, SOH, STX, ETX, EOT, ENQ, ACK, BEL, BS or their code points. Their presence in the stream is terminal for the stream but not an error if all fragments and tags are complete. The other 19 non-whitespace controls are deprecated but unchecked, for efficiency.
An implementation may also check for the C1 and the other non-whitespace C0 control characters.

4If the raw data is UTF-8 then the UNIT is byte, a character is a sequence of all 8-bit units. If the raw data is UTF-16, the UNIT are all 16-bit units (in the case of a PUA or non-BMP Unicode repertiore character, the character may take more than one 16-bit units.) The intent is that an implementation supports one or the other or, preferably, both.

5This grammar can be used to locate fragment and element tags, tokenize start-tags and handle references.

6The use of references in names and ids is allowed but deprecated: it could break raw scanning.

7Implementation impact: This has the effect that the document can be navigated entirely using delimiters. There are no nested entities.

8Implementation Note: The size of the character reference replacement text is never more than the character reference itself. Character reference resolution of a string can be performed in-place, without needing a new buffer.

9These grammar productions should be interpreted as mutually excluding: an identifier required by a prior production is excluded from a subsequent production: e.g. a NAMED-TAG-WITH-FREE-CONTENT has an implicit exclusion of “>”. Reference substitution occurs in tokens and type-values before sub-parsing and TEXT.

11Implementation Note: because fragments cannot nest, the next << after a fragment-start’s << must be the opening delimiter of the corresponding fragment-close. So for linear lexing, the delimiters effectively only need 1 character lookahead from the initial <, with the “/” being a confirming check on the even fragment tags.

12An implementer may use any form of matching (raw bytes, character-refence-substituted bytes, NFC-normalized, etc.) as suits the implementation and deployment goals of the system.

13However, if converting it to some other form without datatyped values or quoted element names, an implementer may decide that they are equivalent.

14Implementation Note: Therefore the whole of any raw token can be read into a 128 byte SIMD vector, before and after character reference substitution, and before and after any Unicode Normalization.

15Implementation Note: Therefore a raw token converted from UTF-8 to UTF-16 can be read into a 128 byte SIMD vector. For example, a name with only Latin alphabetic characters is restricted to 64 Unicode characters; a name with only Han ideographs is restricted to 128/3=42 characters. This may also allow convenient SIMD filtering to exclude, e.g. ASCII only or Han-only tokens from attempts to normalize them.

16Implementation Note: Normal text content and the content quoted attributes should not be normalized by APIs. If the data source is private or trusted to produce normalized names in tokens, an implementation may bypass normalization.

17Implementation Note: This functionality must be implemented and exposed in some API. Given a reference to F1:X123 the value can be found in unparsed raw text by by first scanning for the fragment (a linear scanning of the text for “<<<[^/] until the fragment with @id of “F1” is found, then scanning start-tags until the next “>” for the first attribute value of “X123”. This is a rough-and-ready text operation that can be performed on the raw text, to simulate ID/IDREF or keyed links, and relies on the reference attribute value to be unique among all attribute values in the fragment (not only ID values: it has nothing to do with ID types).