Raw Access Notation (RAN) is a streamable document format designed to allow fast access to elements in fragments on the raw text1. It is designed to be lighter-weight than XML by being lexically richer.
RAN does not currently have an implementation, it is a thought experiment to give ideas on what a future XML-ish markup language needs in order to compete/complement with the technologies, CPUs and applications of the 2020s. Or at least to plausibly play in those spaces without embarrassment. These include in-memory databases such as REDIS, and big data applications using JSON-fragmented-by-newlines.
It is easy for humans to read and write:
It is easy for machines to parse and generate taking advantage of modern CPU capabilities:
- document format - RAN
- validation mechanism - Apatak
- fast access to fragments - RAN Pragma PI
- embedded fielded data with coarse CRUD and coarse queries - RAN-CSV
- document object model - RAN-DOM
- parallel processing
- the document can be efficiently scanned into partitions (each starting with "<<"), and each partition lexed, parsed and processed by separate threads;
- partitions (fragments and elements) can be lexed efficiently using SIMD techniques: the key delimiter character <, > and & are always and only ever delimiters; character-by-character inspection is minimized;
- most XML/HTML/SGML style 'modal' lexing/parsing have been removed, allowing pipelined lexing/parsing and lazy parse-on-demand implementations;
- threads can start lexing/parsing can start from any arbitrary point and establish lexical context quickly; a validation technique (Apatak) has been developed to allow validation starting without context at any point.
- efficient access
- attribute values not using string literals can be parsed directly into common simple datatypes (booleans, numbers, date-time-range, paths, names) without needing a schema
- partitions can be aligned/padded to block boundaries, decimating the number of characters needed to look at to determine them; fragments with ordered ids can use binary chop techniques rather than linear searches;
- appending is allowed;
- UTF-8 conversion can be performed lazily, on-demand;
- the standard public entities sets (only) are built-in, allowing lazy resolution or resolution as part of the lexer rather than parser, and not requiring loading from some external file.
- anonymous elements allows representation of arrays.
It has a different feature-set than JSON, NDJSON, XML, EXI, CSV and REDIS, but learns lessons from all of them.2
- RAN adds
- Fragments, which are top-level elements with identifiers
- they use << and >> instead of < and > for their delimiters.
- they can be speedily located in the RAN steam by skipping between each <<.
- They can be padded to block boundaries for skipped access.
- Links, which let you declare properties on name prefixes
- they use <: and :> delimiters
- they replace the xmlns namespace declaration mechanism of XML
- they allow linking to schemas and so on.
- Anonymous elements to represent arrays.
- The standard public entities only are allowed, and are built in.
- Attributes values can be simply typed as numbers, boolean, dates, etc. without needing a schema.
- Multiple top-level elements
- RAN removes
- any direct use of <, > and & except as delimiters
- all markup declarations (DOCTYPE, ENTITY, NOTATION, ELEMENT, ATTLIST)
- marked sections (CDATA)
A RAN stream is UTF-8 or UTF-16 text3. It is an unbounded sequence of one or more fragments, which each contain a single XML-style element. The end of the stream is not signalled in RAN text, the current fragment must have completed, but the end-of-stream is handled by the data-providing layer.
The non-overloaded, non-modal syntax, reduced character encoding choices, and less strict checking mean that many more operations can be performed as efficient text operations, which XML’s rules were too complex for, if full XML was used. As well, a fragment-providing notation provides a half-way point between streaming processing and tree processing.
The possibilities of the easy-to-scan syntax are exploited in RAN Pragma PIs, which allow a generating application to signal to the receiving parser that it has used various alignment, padding, packing and ordering conventions to reduce the number of comparisons needed to locate all fragments (Aligned Fragments) or particular fragments (Ordered Fragments.)
Here is a very simple RAN document:
<<my-document id=abc123 date=2022-02-22 "A 53 Code" = "1257B">> <p>Hello </p> <</my-document>> <<my-document id=abc124 date=2022-02-22 "A 53 Code" = "1257A">> <p>World</p> <</my-document>>
What is the gap that RAN is designed to fill (without prejudice to its general usefulness)? Consider the scenario of a receiving process in a document-pipeline or web service (i.e., a scenario where the document needs to be parsed anew.) What technologies might we currently prefer for each?:
XML/XSLT 3 (stream)
XML/XSLT 2 (tree)
There is a gap where documents are large enough and the queries on them are easy enough (e.g., //shipTo or //*[@country=’US’]) that the work of parsing and binding to objects (trees or streams) outweighs the work necessary to locate the information.4
1 By raw text is meant that the stream can be lexed, parsed and transduced (i.e. data-type-annotated) in-place, without allocating extra buffer space.
2 Some features of RAN cannot be simply round-tripped through XML or JSON.
- Plain Old XML does not have datatypes. XML Schemas and JSON does not have modern ISO 8601 dates.
- RAN does not have all XML Schemas and JSON data types.
3 The RAN parser peforms no Unicode Normalization; it is the responsibility of the generator, who should use NFC.
4 Stream processing does have a partial answer to this: terminate the stream processing after the information is found. Tree processing has another partial answer to this: parse the tree lazily, so that unnecessary work is minimimized. RAN allows these, but more effectively and, from the developer’s POV, tractibly.