<<Rapid Access Notation>>                    

Rick Jelliffe (C) 2021-2022

Rapid Access Notation (RAN) is a  possible document format designed to allow fast access to elements in fragments on the raw text1.  The design objective is to be lighter-weight than XML and schema, but lexically richer than XML alone. Some key ideas are richer datatypes, more expressive delimiters, and better relation/graph support. 

A RAN document is a sequence of independent fragments, with lexically-determined datatyped values and relational links.  A RAN document is a series of trees of nodes, overlaid with multi-stage addressed links: it is not a tree.

The RAN design is to support streaming, parallel and lazy parsing. It has an eco-system of supporting technologies, include a validation method Apatak and a CSV-embedding convention.

Here is a very simple RAN document: it has two fragments (specified using "<<<" and ">>>") with anchors (specified with "#=#") and an internal link (specified with "=#").

<<<my-document id#=#abc123 date=2022-02-22 "A 53 Code" = "1257B">
           <p>Hello </p>
</my-document id#=#abc123 >>>

<<<my-document id#=#abc124 date=2022-02-22 "A 53 Code" = "1257A">
           <p  
                belongs_with=#abc123 
          >World</p>
</my-document id@=@bc124>>>

RAN has been influenced by XML, SGML, XBRL, CSV, SQLite, XPath, DSSSL and NLJSON. An important design goal has been to go beyond these notations, not simply be another syntax for them.

Grammars

Relationship to XML

Examples

Example Scenarios

Appendix: RAN Pragma PI

Appendix: RAN-CSV Database, Transactions and Logging

Appendix: RAN-DOM (and XDM, RAN Infoset)

Appendix: BRAN - Binary Raw Access Notation (Exploratory: Not recommended)

Changelog

Guide to Tags

A RAN stream is a sequence of fragments; fragments contain elements; elements can contain other elements and data content

RAN is designed to allow particular items to be located and parsed without requiring all parts of the document be loaded and parsed: it allows very lazy, very targetted and highly parallelized loading and parsing of data.

Fragment and element start-tags (and PIs and links) may have attributes. RAN provides a greatly more advanced list of datatypes than other schema languages: quantities rather than just numbers, date-intervals rather than just dates, currencies rather than just numbers, multi-value logic rather than just boolean, anchor-paths, relative-paths,

Graph structures (links) within a stream are supported with specific delimiters for attributes to declare anchors and targets: =, =#, #= and #=#. These delimiters interact with the anchor-path datatype.  A RAN document is not a tree, but can be a sequence of trees with many-to-many links within and between them.

As well, each of those delimiters has a variant "content attributes" with double equals signs: ==, ==#, #==, #==#.   When a single = is used, the attribute applies to the element; when the double == is used, the attribute applies to the content of the element. 

Character references are allowed almost everwhere.

Other tags

RAN does not have the same well-formedness regime as XML. A general parser only needs to report errors if that particular erroneous structure is parsed.

Technical Overview

RAN no harder easy for humans to read and write than XML.

RAN is easier for machines to parse and generate taking advantage of modern CPU capabilities:

  • eco-system
  • parallel processing
    • the document can be efficiently scanned into partitions (each starting with "<<" and ending with ">>"), and each partition lexed, parsed and processed by separate threads;
    • partitions (fragments and elements) can be lexed efficiently using SIMD techniques: the key delimiter character <, > and & are always and only ever delimiters;  character-by-character inspection is minimized; 
    • most XML/HTML/SGML style 'modal' lexing/parsing have been removed, allowing pipelined lexing/parsing and lazy parse-on-demand implementations;
    • threads can start lexing/parsing can start from any arbitrary point and establish lexical context quickly; a validation technique (Apatak) has been developed to allow validation starting without context at any point.
  • lexical typing of rich values  without schemas or  "attributes of attributes"
    • dates allow wildcards, intervals, uncertainty, implementing more of ISO 8601
    • logic allows 3- and 4-value logic, as well as Boolean  (such as "true", "false", "error")
    • numbers can be quantities, with SI metrical units  (such as 10_kg)
    • money can be specified, both amounts (such as $10_AUD) and currencies (such as £GBP)
    • anchor-paths, enabling fast navigation between elements and one-to-many relations
    • URIs
  • efficient access
    • partitions can be aligned/padded to block boundaries, decimating the number of characters needed to look at to determine them;  fragments with ordered anchors can use binary chop techniques rather than linear searches;
    • appending is allowed;
    • UTF-8 conversion can be performed lazily, on-demand;
    • the standard public entities sets (only) are built-in, allowing lazy resolution or resolution as part of the lexer rather than parser, and not requiring loading from some external file.
    • anonymous elements allows representation of arrays.
    • fragment identification may be more efficient because of fragment-end-tag anchors.

It has a different feature-set than JSON, NDJSON, XML, EXI, CSV and REDIS, but learns lessons from all of them.2

Status

RAN does not currently have an implementation, it is a thought experiment to give ideas on what a future XML-ish markup language needs in order to compete/complement with the technologies, CPUs and applications of the 2020s. Or at least to plausibly play in those spaces without embarrassment. These include in-memory databases such as REDIS, and big data applications using JSON-fragmented-by-newlines.  

Background and Milieu

Environment

  • WWW is ubiquitous
  • Unicode is ubiquitous
  • CPUs/GPUs allowing parallelism are ubiquitous
  • Data transfer of structured data is ubiquitous
  • Giant data sets ("big data") is ubiquitous

Lessons

Each technology responds to the technical thinking of the time, supports it, amplifies it, entrenches it, then falls victim of its success as the sour-spots generate a next generation.

  • SGML successfully applied 1975-1985  software engineering concepts to industrial publishing:
    • standardized markup (fixed delimiter roles),
    • generalized markup ("semantics not presentation", the database concept of separating data and application),
    • languages (define allowed documents using grammars: DTDs),
    • "little languages/toolkit/compiler-compiler" (no standard schemas for applications: provide means for customization by gurus for users)
    • model (documents as trees, attributes used to link  within document set)
    • macros (entities)
    • incompatible platforms (adapt for each use)
    • pipeline processing
    • QC  inspection and validation (DTDs)
    • QA: verification and firewalling (DTD validity required on parsing prevents progress of bad documents)
  • HTML applied 1985-1995 software engineering concepts
    • Personal computing: personal data, fault tolerant, ad hoc, non-specialist, simplistic, evolving
    • Client-server architecture
    • URLs unify access to 3rd party and remote resources
    • Navigation as client-side decision
    • Incremental hypertext; hypertext as jumping outside to something interesting (rather than Nelson style transclusion and annotation)
  • XML back fitted these 1990 to 2000 concepts
    • Remove guru aspects of SGML
    • Support Unicode
    • Assume WWW (downplay entities, etc)
    •  t should be simple to do something simple (Perl idea; allow ad hoc documents without schemas)
    • QC and QA as application decision not enforced
    • Standard APIs (DOM, SAX)
    • Support dynamic databinding and types (XML Schemas)
  • JSON applied concepts of 1995 to 2000s:
    • Ubiquity of JavaScript
    • Ubiquity of UTF-8
    • Assumption that client "document" is a program
    • Computer-generated content (Data, meadata and media interchange not authored document interchange)
    • Automated data-binding without schemas
  • JSONs underspecification made it a prime candidate for extensions, to suit particular uses
    • Big data requiring fast identificaiton of fragments: the line-oriented JSON plus newlines formats, such as nljson
    • YAML adopted

A common repeating feature is the trope that syntax is so unimportant, so everyone should use my syntax!

The lesson I draw from these is that no format is eternal.

 

A RAN stream is UTF-8 or UTF-16 text3. It is an unbounded sequence of one or more fragments, which each contain a single XML-style element. The end of the stream is not signalled in RAN text, the current fragment must have completed, but the end-of-stream is handled by the data-providing layer.

The non-overloaded, non-modal syntax, reduced character encoding choices, and less strict checking mean that many more operations can be performed as efficient text operations, which XML’s rules were too complex for, if full XML was used. As well, a fragment-providing notation provides a half-way point between streaming processing and tree processing.

The possibilities of the easy-to-scan syntax are exploited in RAN Pragma PIs, which allow a generating application to signal to the receiving parser that it has used various alignment, padding, packing and ordering conventions to reduce the number of comparisons needed to locate all fragments (Aligned Fragments) or particular fragments (Ordered Fragments.)Positioning

What is the gap that RAN is designed to fill (without prejudice to its general usefulness)? Consider the scenario of a receiving process in a document-pipeline or web service (i.e., a scenario where the document needs to be parsed anew.) What technologies might we currently prefer for each?:


There is a gap where documents are large enough and the queries on them are easy enough (e.g., //shipTo or //*[@country=’US’]) that the work of parsing and binding to objects (trees or streams) outweighs the work necessary to locate the information.4

RAN shares similar concerns (though not details) with the simdjson project and the various fragmented JSON formats: JSON Lines, ndjson, etc.

Footnotes

1 By raw text is meant that the stream can be lexed, parsed and transduced (i.e. data-type-annotated) in-place, without allocating extra buffer space.

2 Some features of RAN cannot be simply round-tripped through XML or JSON.

  • Plain Old XML does not have datatypes.  XML Schemas and JSON does not have modern ISO 8601 dates.
  • RAN does not have all XML Schemas and JSON data types. 

3 The RAN parser peforms no Unicode Normalization; it is the responsibility of the generator, who should use NFC.

4 Stream processing does have a partial answer to this: terminate the stream processing after the information is found. Tree processing has another partial answer to this: parse the tree lazily, so that unnecessary work is minimimized. RAN allows these, but more effectively and, from the developer’s POV, tractibly.