A RAN stream can be processed at many levels of granularity, depending on the requirement.
The most basic division is that a RAN document is made from a number of independent partitions called fragments.
RAN can be lexed at many different grains, the most important of which is below. In general, there are six basic levels of granularity:
- Finite-stream tags (look for <<<< at start and end only)
- Fragment identification (look for <<<)
- Tag identification (look for <)
- Context tags (look for <<)
- Attribute tokenizing (look for spaces, =, >, " etc.)
- Attribute indicator link-typing
Then there are several levels of “notation” parsing that may be done by the parser, or lazily on demand, or or by some application layer.
- Attribute-value data-typing (see section Extended Lexical Datatypes)
- Embedded CSV (see section RAN-CSV)
- Single-character rich text tags (quasi HTML)
- Potentially, other notations
An individual partition (i.e. a section of the document that starts with the fragment-start-tag open-delimiter "<<<]") may be ill-formed without this making any subsequent partition ill-formed. Each partition can be parsed separately.
In RAN, end-tags (element, scoped-element, fragment) must have a duplicate of the initial ID attribute on the corresponding open-tag. Fragments and scoped-elements must have IDs.
An attribute specified as x=y is not the same as x=“y“ nor “x“=y nor “x“=“y“.13
Simple Scanning by Regex
The ability to parse to milestones using regular expressions, then at or between those milestones match values using the Tokenizing grammar means that easy queries can be performed on the raw text array without allocating objects.1
Scan for tags (Regular Expression)
The simplest complete lexical scanner breaks the document into sequences of [tag-open delimiter, tag contents, tag-close delimiter, data ] capture-group tuples.
stream := ws* (
( "<"+ "/"? "?"? )
( [^<>]* ) TAG: Max 2^16 bytes
( ">"+ )?
( [^<>]* ) DATA: Max 2^16 bytes
)+
0x00 -0x20 END
Here is a more complex scanner that also tokenizes inside tags: attributes, PIs and comments, but not inside tuples. The rules here implement the intended semantic that a < or > delimiter always closes the tag, so provides error recovery.
stream := $blank* ( ; IGNORABLE WHITESPACE
("<--" [^<>]* "-->") | ; COMMENT
(
( "<"+ ("/" | "?" |"_" )? ) ; TAGO
(
( [^<>\s\"]{1-255} ) ; GI =< 256 bytes
($blank+ ; IGNORABLE WHITESPACE
"..." | ; ELLIPSIS
"?" | ; EXTERNAL
$LITERAL |
$TUPLE |
$RATING | ; RATING
( [^\"=\[\]\.\?\s<>][^<>\s\"=]* ) | ; TOKEN
\blank+ ; IGNORABLE WHITESPACE
)*
)?
( ("?"? ">"+ ) | ; TAGC (1 expected)
( [^<>]+ ) ; DATA
)*
)
)+
RATING ::= ( \p{Emoji} | \p{Flag} | \p{Arrow} | \p{SpacingMark} )+ ; > ASCII
LITERAL ::= ASCII-LITERAL | EXTENDED-LITERAL
ASCII-LITERAL ::= "\"" [^<>\"]{0-65534} "\""
| "\'" [^<>\']{0-65534} "\'"
| "\%" [^<>\%]{0-65534} "\%" // defined as EXTERNAL
| "\*" [^<>\%]{0-65534} "\*"
| "#" [^<>#]{0-65534} "#"
| "@" [^<>@]{0-65534} "@"
| "^" [^<>#]{0-65534} "^"
| "#" [^<>#]{0-65534} "#"
| "*" [^<>*]{0-65534} "*"
| "!" [^<>!]{0-65534} "!"
EXTENDED-LITERAL ::= $lit-open ( [^<>\{p}] | \{p}(![\s<>]) ){0-65534} $lit-close
TUPLE ::= ASCII-TUPLE | EXTENDED-TUPLE
ASCII-TUPLE ::=
| "[" [^<>\)]{0-65534} "]"
| "(" [^<>\)]{0-65534} ")"
| "{" [^<>}]{0-65534} "}"
EXTENDED-TUPLE ::=$tuple-open ( [^<>\{p}] | \{p}(![\s<>]) ){0-65534} $tuple-close
blank ::= [0x00-0x20]+
lit-open ::= {\p} ; non-ASCII non-paired punctuation
lit-close ::= {\p} ; same as previous lit-open
tuple-open ::= {\p} ; non-ASCII paired punctuation
tuple-close ::= {/p} ; pair of previous
A TOKEN is any run of characters that is not ASCII whitespace, <, > can further be lexically typed using the following, tested in order:
- starts with +- or digit then is QUANTITY e.g. 0 or +1 or 440_Hz/m
- contains - not at start or end then is ISO 8601 DATE TIME RANGE e.g. 2024-08-27 or
- contains + or ~ not at start or end then is ANCHOR PATH e.g. fred+ginger
- contains “/” then is URI e.g. /x/y
- contains “:” then is PREFIXED NAME e.g. fred:ginger
- starts with $ then is MONEY e.g. $26.00 or $26.00_AU_2024-16-01
- starts with LATIN 1 block character then is SIMPLE (typically an ASCII identifier, logic value etc)
- starts and ends with “%” then is EXTERNAL
- starts with Unicode currency (any value in currency block) then is MONEY
- otherwise is SIMPLE
SIMPLE can further be broken down into reserved and pre-defined logic values, otherwise are NAMEs.
Scan for Fragment boundaries only (Regular Expression)
stream ::= .*(<<<[^/]*)*
or
stream ::= .*((<<<[^/]*<<<[/].*>>>).*)*
Fragments cannot nest, so matching start and end tags is reliable. Fragment comments, PIs, etc are excluded.
Scan for Fragment Tags only (Regular Expression)
stream ::= .*(<<<[^<].+>>>[^>].*)*
Scan for Scoped Element boundaries in Fragments (Regular Expression)
fragment contents ::= .*(<<[^/].+.*)*
Scan for any Tags in Fragments (Regular Expression)
fragment contents ::= .*((<<?.+>>?).*)*
Scan for any Tags in Scoped Elements or Fragments with no Scoped Elements (Regular Expression)
element contents ::= .*((<.+>).*)*
Scan for Tags (Regular Expression): 2
stream ::= \s*((<[<[<]?]?.+>[>[>]?]?).*)*
Encoding Grammar:3
stream ::= UNIT*
UNIT ::= U8* | U16* 4
U8 ::= BYTE > 8
U16 ::= WORD > 8
Complete Tokenizing
Outer tokenizing (regular expression)
DOCUMENT ::= ws? ( MARKUP | DATA )+
MARKUP ::= ( TAG | COMMENT | IP )*
TAG ::= TAGO IN-MARKUP TAGC
TAGO :: STREAM-TAGO | FRAGMENT-TAGO | SCOPED-TAGO | GENERAL-TAGO
TAGC :: STREAM-TAGC | FRAGMENT-TAGC | SCOPED-TAGC | GENERAL-TAGC
STREAM-TAGO ::= "<<<<" "/"?
FRAGMENT-TAGO ::= "<<<" "/"?
SCOPED-TAGO ::= "<<" "/"?
GENERAL-TAGO ::= "<" "/"?
STREAM-TAGC ::= ">>>>"
FRAGMENT-TAGC ::= ">>>"
SCOPED-TAGC ::= ">>"
GENERAL-TAGC ::= ">"
IP ::= "<?" IN-MARKUP "?>"
COMMENT ::= "<--" ANY* "-->"
IN-MARKUP ::= [^<>]+
DATA ::= [^<>]+
Inner tokenizing (regular expression)
IN-MARKUP ::= TAG-NAME (blank+ (TAG-TOKEN blank*)* )?
TAG-NAME ::= LITERAL | LEXICAL-TOKEN
TAG-TOKEN ::= INDICATOR | ELLIPSIS | LITERAL | LEXICAL-TOKEN | TOKEN| RATING
INDICATOR ::= “="+ "}”?
ELLIPSIS ::= "..."
RATING ::= ( \p{Emoji} | \p{Flag} | \p{Arrow} | \p{SpacingMark} )+ ; > ASCII
LITERAL ::= ASCII-LITERAL | EXTENDED-LITERAL
ASCII-LITERAL ::= "\"" [^<>\"]{0-65534} "\""
| "\'" [^<>\']{0-65534} "\'"
| "\%" [^<>\%]{0-65534} "\%"
| "\*" [^<>\%]{0-65534} "\*"
| "#" [^<>#]{0-65534} "#"
| "@" [^<>@]{0-65534} "@"
| "^" [^<>#]{0-65534} "^"
| "#" [^<>#]{0-65534} "#"
| "*" [^<>*]{0-65534} "*"
| "!" [^<>!]{0-65534} "!"
EXTENDED-LITERAL ::= $lit-open ( [^<>\{p}] | \{p}(![\s<>]) ){0-65534} $lit-close
TUPLE ::= ASCII-TUPLE | EXTENDED-TUPLE
ASCII-TUPLE ::=
| "[" [^<>\)]{0-65534} "]"
| "(" [^<>\)]{0-65534} ")"
| "{" [^<>}]{0-65534} "}"
EXTENDED-TUPLE ::=$tuple-open ( [^<>\{p}] | \{p}(![\s<>]) ){0-65534} $tuple-close
blank ::= [0x00-0x20]+
lit-open ::= {\p} ; non-ASCII non-paired punctuation
lit-close ::= {\p} ; same as previous lit-open
tuple-open ::= {\p} ; non-ASCII paired punctuation
tuple-close ::= {/p} ; pair of previous
Notes: These are first-match, highest-match productions: so <a b= [ “a dog]” ] will generate
an error, because the tuple detection looks for the first ] and the error will be
that the literal did not end with punctuation. Highest-match means that generally
if we have x ::= non-terminal terminal, then we can look for the terminal to limit
the non-terminal, then match the terminal: this is perhaps like non-greedy evaluation.
First-match means that if there is x::= ( a | b ) then if a matches we do not need
to look for b.
So some of the productions may add or omit redundant checks: for example LEXICAL-TOKEN
disallows > but this should have been matched already by the highest-match rule.
For Extended-Literals {\p} means any Non-ASCII Unicode BMP character of category punctuation
but excluding combing characters and separators and characters with “COMMA” “FULLSTOP”
and “SEMICOLON” in their name as well as star-like characters and arrows. If the character
has a pair, then the pair is what should be used in closing. (This is too complex
to be shown in the Regex, but may be implemented more easily in a lookuptable. However,
for tokenizing, this pairing is not necessarily checked fully: it is permissible to
implement using the regex above which is a more fragile text: that the literal can
end at the first non-ASCII punctuation that is followed by a space or <> if that
is what the tooling supports. There may be a better way to capture the real constraint
in some regex language . )
Tokenizing grammar (regular expression) to support chunking
DOCUMENT ::= CHUNK+
CHUNK ::= ( <START> blank* META*)
| (STREAM-OTAG META* )
| (STREAM-ETAG META* )
| (FRAG-OTAG META* CONTENT* FRAG-ETAG META* )
| CONTENT*
| <END>
CONTENT ::= TAG | DATA-RUN
META ::= ( COMMENT-RUN | IP-RUN )
STREAM-STAG ::= STREAM-STAGO MARKUP-RUN STREAM-TAGC
STREAM-ETAG ::= STREAM-ETAGO MARKUP-RUN STREAM-TAGC
FRAG-STAG ::= FRAG-STAGO MARKUP-RUN FRAG-TAGC
FRAG-ETAG ::= FRAG-ETAGO MARKUP-RUN FRAG-TAGC
TAG ::= SCOPED-STAGO | GENERAL-STAGO | SCOPED-ETAGO | GENERAL-ETAGO
| STREAM-TAGC | FRAGMENT-TAGC | SCOPED-TAGC | GENERAL-TAGC
| COMMENT-RUN | IP-RUN
IP-RUN ::= IP-TAGO MARKUP-RUN IP_TAGC
COMMENT-RUN ::= COMMENT-TAGO text COMMENT-TAGCDATA-RUN = text+
MARKUP-RUN = text+
blank ::= [^\x00-\x20<>]+
text ::= [^<>]+
STREAM-STAGO ::= "<<<<"
FRAGMENT-STAGO ::= "<<<"
SCOPED-STAGO ::= "<<"
GENERAL-STAGO ::= "<"
STREAM-ETAGO ::= "<<<</"
FRAGMENT-ETAGO ::= "<<</"
SCOPED-ETAGO ::= "<</"
GENERAL-ETAGO ::= "</"
STREAM-TAGC ::= ">>>>" blank?
FRAGMENT-TAGC ::= ">>>" blank?
SCOPED-TAGC ::= ">>" blank?
GENERAL-TAGC ::= ">"
IP-TAGO ::= "<?"
IP-TAGC ::= "?>" blank?
COMMENT-TAGO ::= "<--"
COMMENT-TAGC ::= "-->" blank?
Tokenizing grammar: (context-sensitive)5
Includes basic lexing within tag
DOCUMENT ::= ws? ( MARKUP | DATA )+
MARKUP ::= ( TAG | COMMENT | IP )*
TAG ::= TAGO IN-MARKUP TAGC
TAGO :: STREAM-TAGO | FRAGMENT-TAGO | SCOPED-TAGO | GENERAL-TAGO
TAGC :: STREAM-TAGC | FRAGMENT-TAGC | SCOPED-TAGC | GENERAL-TAGC
STREAM-TAGO ::= "<<<<" "/"?
FRAGMENT-TAGO ::= "<<<" "/"?
SCOPED-TAGO ::= "<<" "/"?
GENERAL-TAGO ::= "<" "/"?
STREAM-TAGC ::= ">>>>" blank*
FRAGMENT-TAGC ::= ">>>" blank*
SCOPED-TAGC ::= ">>" blank*
GENERAL-TAGC ::= ">"
IP-TAGO ::= "<?"
IP-TAGC ::= "?>" blank?
COMMENT-TAGO ::= "<--"
COMMENT-TAGC ::= "-->" blank?
IP ::= IP-TAGO IN-MARKUP IP-TAGC
COMMENT ::= COMMENT-TAGO ANY* COMMENT-TAGC
IN-MARKUP ::= TAG-NAME (ws+ (TAG-TOKEN ws*)* )?
TAG-NAME ::= LITERAL | LEXICAL-TOKEN
TAG-TOKEN ::= INDICATOR | ELLIPSIS | LITERAL | LEXICAL-TOKEN | TOKEN | RATING
INDICATOR ::= “="+ "}”?
ELLIPSIS ::= "..."
RATING ::= ( \p{Emoji} | \p{Flag} | \p{Arrow} | \p{SpacingMark} )+ ; > ASCII
LITERAL ::= ASCII-LITERAL | EXTENDED-LITERAL
ASCII-LITERAL ::= "\"" [^<>\"]{0-65534} "\""
| "\'" [^<>\']{0-65534} "\'"
| "\%" [^<>\%]{0-65534} "\%"
| "\*" [^<>\%]{0-65534} "\*"
| "#" [^<>#]{0-65534} "#"
| "@" [^<>@]{0-65534} "@"
| "^" [^<>#]{0-65534} "^"
| "#" [^<>#]{0-65534} "#"
| "*" [^<>*]{0-65534} "*"
| "!" [^<>!]{0-65534} "!"
EXTENDED-LITERAL ::= $lit-open ( [^<>\{p}] | \{p}(![\s<>]) ){0-65534} $lit-close
TUPLE ::= ASCII-TUPLE | EXTENDED-TUPLE
ASCII-TUPLE ::=
| "[" [^<>\)]{0-65534} "]" | "(" [^<>\)]{0-65534} ")"
| "{" [^<>}]{0-65534} "}"
EXTENDED-TUPLE ::=$tuple-open ( [^<>\{p}] | \{p}(![\s<>]) ){0-65534} $tuple-close
blank ::= [0x00-0x20]+
lit-open ::= {\p} ; non-ASCII non-paired punctuation
lit-close ::= {\p} ; same as previous lit-open
tuple-open ::= {\p} ; non-ASCII paired punctuation
tuple-close ::= {/p} ; pair of previous
There are two kinds of character references which can be used at any point in the document6, both in data and in tags, but are not then lexed as delimiters7:
-
Numeric character references (delimiters “&#x” and “;”) are the hex Unicode number.
-
Named character references (delimiters “&” and “;”) are the standard W3C entity set version.8
Full Tag Grammar
(not corrected yet: needs extended tuple and literal delimiters, blanks) In the following,
"ws" is any (ASCII) whitespace.9
document ::= head? ( body? ) end?
head ::= meta+
body ::= finite-stream | fragment* | scoped-element | general-element
end ::= 0x00 - 0x20
finite-stream ::= f-stream-start-tag meta* stream-contents f-stream-end-tag meta*
stream-contents ::= fragment+ | scoped-element | general-element
f-stream-start-tag ::= “<<<<” name-id attribute+ ellipsis? “>>>>”
f-stream-end-tag ::= “<<<</” name-id attribute “>>>>”
fragment ::= fragment-start-tag meta* f-contents fragment-end-tag meta*
fragment-start-tag ::= “<<<” name-id attribute+ ellipsis? “>>>”
fragment-end-tag ::= “<<</” name-id attribute “>>>”
f-contents ::= (scoped-element | general-element) *
scoped-element ::= scope-start-tag meta* contents scope-end-tag meta*
scope-start-tag ::= “<<” name-id-ref attribute+ ellipsis? “>>”
scope-end-tag ::= “<</” name-id-ref “>>” meta*
general-element ::= start-tag contents end-tag
start-tag ::= “<” name-id-ref attribute* ellipsis? “>”
end-tag ::= “</” name-id-ref “>”
contents ::= ( tags | TEXT | general-element | scoped-element )*
meta ::= comment | ip
comment ::= “<--" [^<>]* “-->”
ip ::= “<?” name attribute* "-"+ “?>”
primary-identifier ::= uuid | primary | key
name-id ::= primary-identifier | name
name-id-ref ::= primary-identifier | reference
attribute ::= uuid | primary | key | reference | attribute
uuid ::= name blank? "====" blank? value blank? rating blank?
primary ::= name blank? "===" blank? value blank? rating blank?
key ::= name blank? "==" blank? value blank? rating blank?
reference ::= name blank? "=}" blank? value blank? rating blank?
attribute ::= name blank? "=" blank? value blank? rating blank?
name ::= symbol| literal
value ::= lexical-datatype | token | literal | tuple RATING ::=
literal ::= “"” any* “"” blank?
tuple ::= ( "[" ( blank | token | literal ) "]" blank? )+
ellipsis ::= "..." blank?
symbol ::= ([^\.=<>][^\-:=<>]*/) // no lexical delimiters
lexical-token ::= ([^\.<>][^\-:=<>]*/)
blank ::= [\0x00-0x20]+
See Extended Lexical Datatypes of the grammar for lexically-typed-tokens. These utilize various symbols, such as $, +, €, ¤, °, and so on, to select a datatype and parse the components. This reduces markup, as the RAN datatypes implement more of the base standards than the common markup languages and XML Schemas datatypes.
- All references start with delimiter & and end with ; and "&" never is used literally. Use ≈
- All tags start with the delimiter < and end with the delimiter > and "<" and ">" are
never used literally. Use < and >.
- All delimiter recognition only takes 1 character lookahead (open delimiter <) or look
behind, except for fragment tags and scoped -element tags.
11
- The matching tag delimiters pairs are </? > <? ?> <! -> <: :> (
- The delimiter set is <>&;”?!/:- []#%=. _}
Inside their delimiters, tags are lexed according to one of two lexical patterns:
NAMED-TAG-WITH-ATTRIBUTES
for start-tags, end-tags and IPFREE
-TEXT for comments
There are three forms for elements:
- normal elements: must have name token or string e.g. <a>xx</a> or <"a">xxx</"a">.
- list elements: must have “_” for the name, e.g. “<_>” and "</_>" .
- Non whitespace text directly under a list element is an error.
- empty end-tag: must have name token or string for start-tag but not end: e.g. <a>xxx</>. These may not have sub-elements, and are provided for terseness in large records.
- An empty end tag coming after an end-tag or comment or ip is an error.
- The XML self-closing tag e.g. <a/> is not available: the closest is to use this: <a></>
Whitespace text runs
(To be checked) A Whitespace text run is any run of text between tags that only contains ASCII whitespace: such text is Ignorable (not part of the Infoset) or Reportable. Not all Reportable whitespace is necessarily significant. Whether whitespace is Ignorable depends only on the two immediately surrounding tags.
- Whitespace runs that appears between two start-tags (of any kind) or between two end-tags (of any kind) is Ignorable. I.e,
- */node()[1][self::text()][normalize-space(.)='']
- */node()[last()][self::text()][normalize-space(.)='']
- Whitespace runs that comes before or after a steam, fragment or scope start- or end-tag is Ignorable.
- Other whitespace runs are Reportable: in particular, whitespace between an end- and a start-tag (internal whitespace) or between an start- and an end-tag (blank element).
The intent is to reduce the number of nodes in a DOM, to allow fragments to represented as a simple list, and to make indentation of the first few levels harmless.
Character References and Binary Data
There are three kinds of character references:
- Named character reference
- e.g., —
- All and only SGML/MathML/HTML public entities are built-in.
- Hex Numeric character references
- e.g., ¡
- All non-control UNIX characters are allowed. (i.e. not C0, C1, therefore no 0x00 NULL)
- Binary String
- A tuple can contain a sequence of hexadecimal numbers. Such as [ 0x0123456A 0x0123456b ]. There is no restriction on these numbers to prevent 0 etc.
- A RAN DOM implementation can provide utility functions on an attribute that if all the tokens in a tuple are power of 2 in length and are hex numbers, then they can be converted to an array of binary data: uint8, uint16, uint32, uint64. etc. The function returns an array of the appropriate size. For example, toBinaryBytes(), toBinaryShort(), toBinaryInt(), to BinaryLong() etc.
- These can only be used as attribute values, and so may not be
Character references are all dereferenced after all lexical operations are performed. Hence:
- The only way to reresent the delimiters <>& in text is to use a character reference.
- A character reference e.g. < is never recognized as a delimiter of any kind, and never changes the recognition of markup.
- Character references are supported in tokens (including element names), literals, data content, and implementation parameters.
The replacement UTF-8 for any character reference takes up fewer bytes than the reference. So de-referencing can be done in a byte buffer without having to grow it.
Constraints:
Implementations can rely on that (for a well-formed document):
- A character entity reference never expands to a direct character that uses more bytes
than the reference.
- Implication: Reference expansion can be done in place
- An element or attribute name (Lexical TOKEN) never exceed 256 bytes (UTF-8) (and therefore 256 characters) or before any reference substitutions are made.
- Implication 8 bit indexes can be used into tokens
-
A Lexical TOKEN according to the Tokenizing Grammar (an element or attribute name or attribute value without double quotes) has an absolute length restriction of 256 bytes as raw text, before character reference substitution14 and a token acccording to the Full Grammar has a maximum length of 256 Unicode characters after substitution.15
- Each literal or tuple or tag or text run never exceeds length 2^ 16, i.e. 64K
- Implication: 16 bit indexes an be used into tags and tokens and tuples and text runs
- Each fragment or scope or element or tag or data run cannot exceed length 2^31 bytes, i.e., 2 Gig
- Implication: 31bit uint index can be used
- The edge case of is a document containing entirely of a single comment.
- A RAN parser can reject documents that are over 2 Gig if they cannot support that size.
Tree constraints
- The name in an end-tag must match12 that of the corresponding start-tag.
- If the identifier of a stream, fragment or scope start-tag wiil match that in the end-tag.
Parsing simplification
- There are no elaborate checks for which characters can appear in names or strings, only what is needed for parsing.
- The grammar gives the lexical detection rules for boolean, date, number and name. The value should be checked for full lexical correctness before it is first used, but there is no requirement to check full lexical correctness or value-validity if the datatype is no accessed.
- The characters <> and & are only and in every context treated as delimiters,
- Whitespace is any character under 0x20
Possible Parser Interface (TBD)
(Probably remove.)
This Russian Doll designs allows streamlined access to individual items. For example, fast iterators or cursors such as, which can have no argument or take a name and/or ID:
- ran::getNextStartFragmentInDocument() -> Fragment
- ran::getNextStartContextInFragment() -> Context
- ran:getNextStartElementInContext()->StartElement
- ran:getNextAttributeInStartTag() -> Attribute
- ran:getNextIdInStartTag() -> ID
- ran:getNextLinkInStartTag() -> Link
Or chainable: /*/*[@i="drugs"]/*[@i="pain"]/drug[@i="aspirin"][@dangeous="NO"]/@dosage
d: RanDocument =9 new RanDocument(...);
q: Quantity? =
d.getNextFragmentStart("", "drugs") iterate through fragments stag
.getNextContextStart("", "pain", LAZY) iterate through contexts stag
.getNextElementStart("drug", "aspirin", FORGET) iterate through element stag
.getElementNode(LAZY | DESC) get node or parse if not parsed
.whereAttribute("dangerous:", Logic("NO")) condition
.getNextAttribute("dosage","*") iterate through attributes
.getTypedValue(FORGET); get quantity
where the string is a regex that matches an id (@i or first position :=) flags are different parse actions:
- FORGET - do not update info or objects on intermediate tags and data, until the one is found
- LAZY - do not parse attributes except id, nor child elements, until the one is found
- (none) - parse all elements or attributes until the one is found
- TYPE - parse all elements or attributes and do datatyping until the one is found
- FULL - pre-parse all element atributes and do datatyping
and different scope
- (none) - just this tag or node
- DESC - do same on descendents
- SBL - do same on sibling
- BRANCH - do same on entire branch that contains this
- SCOPE - do same on entire scope
- FRAGMENT - do same on entire fagment
- ALL - do same on stream
Generic Master Parsing Function (TBD)
(probably remove)
A master generic parser function that combines parsing and simple query might be:
- parseRanDocument(
from: Node,
direction: enum { forwards | backwards },
fragment_ids: []ID,
ids:[]ID,
produce_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, all_elements, attributes, children, typed-attributes, instruction-parameters, instruction-parameter-attributes, comments } ,
extended: bitset { table, CSV, rich-text, ... },
validate_nodes: bitset { fragments, scoped-elements, elements, next-sibling-context, all-sibling-context, attributes, typed-attributes, note-attributes,instruction-parameters, instruction-parameter-attributes, comments } ,
validation: bitset { fatal, error, warning, info, hint, verbose }
)
=> Result {
(validityReport, fragmentNode(), Node(), extended data )*,
status?}
This function would
(a) Scan the document (backwards or forwards from some position in the document) for the fragments with the given fragment IDs
(b) Parse the fragment to the extent needed to produce the particular nodes specified in “produce_nodes”, producing the kinds of nodes specified by produce_nodes
- E.g. produce_nodes of FRAGMENTS|SCOPED-ELEMENTS|ELEMENTS|NEXT_SIBLING_CONTEXT would, if looking for element with ID of “x123” produce a tree of nodes with the fragment, and any element or scoped-elements between the fragment and the identified element, the identified element, the immediate sibling elements of these, but no other child nodes. It would not produce attribute values. If this result was copied by value, it would lose the ability to have the other parts lazily evaluation. Otherwise, asking for an attribute value would cause parsing of the values.
(c) The document is also validated, using the same options.
- So the information validated can be different (less or more) than the information produced. For example, from “validate the whole document but do not produce a parse tree” to "produce a complete parse tree but do not validate it."
The intent is that minimum-necessary substree can be extracted at parse time, with the maximum appropriate laziness, for a particular task.
Rating
An attribute (name-indicator-value triple) may be followed by one or more Rating character: this annotates the attribute with arbitrary information. The characters used are graphical characters that cannot be used in name tokens: any character in the Unicode block U+1F000 to U+1FFFF can be used,
- 2 byte UTG-b
- Miscellaneous Symbols
- U+2600..U+26FF
- WingDing character.
- U+2700..U+27BF
- No semantics defined. However, probably
- U+2714 check mark and
- U+2716 tick might be used to indicate some validity problem with an attribute
- ??? U+ 3299 secret
- 4-byte UTF-8
- The lexer will recognize any UTF8 character that starts with 0xF0 0x9F as a rating character. Particular of interest are
- Emojis ( Unicode 1F600 to 1F6FF)
- No semantics defined. Presumably it is some kind of classification or some editorial instruction.
- National Flags. (REGIONAL INDICATOR SYMBOLS)
- The flags may be used to indicate country of origin, particularly that the text of the value is a language, probably the main national language by implication
- (TODO) other spacing marks, arrows
The characters that can be used are roughly
- Unicode property Symbol
- category currency symbol Sc and other symbol So
- e.g. emoji, flag, checkmark. currency, but not mathematical
- and not ASCII
- and not Unicode property Punctuation
- i.e. not extended literal or extended tuple delimtier
- not letter or digit (not name character)
So the letter that should be in names tokens are positively:
- Unicode property letter or digit
- other ASCII characters not used in (first char of) delimiter -+._*#/ |\:
Or tokenization can be done by exclusion:
- not blank
- not delimiter <>=&
- not literal or tuple delimiter
- not non-ASCII punctuation or symbols etc
Here are some examples of the use (these are examples, RAN does not specify any semantics here, the system developer is free to adopt them for their use.):
Footnote
1This makes it straightforward for developers to exploit optimizations available on modern CPU and disks: SIMD, Vectorization, parallel threads, SPMD, predictable prefetch, cache sizes, locality from in-place referencing.
2This grammar can be used to locate fragments and element tags.
3Implementation Note: So the implementation is free to use NULL, SOH, STX, ETX, EOT,
ENQ, ACK, BEL, BS or their code points. Their presence in the stream is terminal for
the stream but not an error if all fragments and tags are complete. The other 19 non-whitespace
controls are deprecated but unchecked, for efficiency.
An implementation may also check for the C1 and the other non-whitespace C0 control
characters.
4If the raw data is UTF-8 then the UNIT is byte, a character is a sequence of all 8-bit units. If the raw data is UTF-16, the UNIT are all 16-bit units (in the case of a PUA or non-BMP Unicode repertiore character, the character may take more than one 16-bit units.) The intent is that an implementation supports one or the other or, preferably, both.
5This grammar can be used to locate fragment and element tags, tokenize start-tags and handle references.
6The use of references in names and ids is allowed but deprecated: it could break raw scanning.
7Implementation impact: This has the effect that the document can be navigated entirely using delimiters. There are no nested entities.
8Implementation Note: The size of the character reference replacement text is never more than the character reference itself. Character reference resolution of a string can be performed in-place, without needing a new buffer.
9These grammar productions should be interpreted as mutually excluding: an identifier required by a prior production is excluded from a subsequent production: e.g. a NAMED-TAG-WITH-FREE-CONTENT has an implicit exclusion of “>”. Reference substitution occurs in tokens and type-values before sub-parsing and TEXT.
11Implementation Note: because fragments cannot nest, the next << after a fragment-start’s << must be the opening delimiter of the corresponding fragment-close. So for linear lexing, the delimiters effectively only need 1 character lookahead from the initial <, with the “/” being a confirming check on the even fragment tags.
12An implementer may use any form of matching (raw bytes, character-refence-substituted bytes, NFC-normalized, etc.) as suits the implementation and deployment goals of the system.
13However, if converting it to some other form without datatyped values or quoted element names, an implementer may decide that they are equivalent.
14Implementation Note: Therefore the whole of any raw token can be read into a 128 byte SIMD vector, before and after character reference substitution, and before and after any Unicode Normalization.
15Implementation Note: Therefore a raw token converted from UTF-8 to UTF-16 can always be contained two 128 byte SIMD vectors. For example, a name with only Latin alphabetic characters is restricted to 128 Unicode characters; a name with only Han ideographs is restricted to 256/3Inside literals in tags (e.g. attribute values), the “=” character cannot be used literally: use &eq;.Inside literals in tags (e.g. attribute values), the “=” character cannot be used literally: use &eq;.=85 characters. This may also allow convenient SIMD filtering to exclude, e.g. ASCII only or Han-only tokens from attempts to normalize them.
16Implementation Note: Normal text content and the content quoted attributes should not be normalized by APIs. If the data source is private or trusted to produce normalized names in tokens, an implementation may bypass normalization.
17Implementation Note: This functionality must be implemented and exposed in some API. Given a reference to F1:X123 the value can be found in unparsed raw text by by first scanning for the fragment (a linear scanning of the text for “<<<[^/] until the fragment with @id of “F1” is found, then scanning start-tags until the next “>” for the first attribute value of “X123”. This is a rough-and-ready text operation that can be performed on the raw text, to simulate ID/IDREF or keyed links, and relies on the reference attribute value to be unique among all attribute values in the fragment (not only ID values: it has nothing to do with ID types).