Rapid Access Notation 乱

Here is a simple example of a RAN document, with a typical infoset on the right

So RAN will accept simple XML documents with just elements, attributes and the XML header. However, it will report slightly different information to the XML infoset. It does not know what the XML header means.

Here is a more complex example, of a stream (identified with a UUID) of one fragment (identified by a stream-unique ID), containing a single element (identified with a non-unique key):

This example has some things different to XML: stream and fragment tags (with more than one < and >); UUID (4 =) and key (3 or 2 =) attributes, datatyping for dates, tokens, strings, etc. (using definite lexical patterns), built-in named character references (fr the standard characters), attribute values in arrays, namespace implication to HTML (for one or two letter names) . These extra delimiters allow the type benefits of schemas without their overhead, and opens the door to many optimisations possible with modern processors but unsafe for XMLor JSON.

Goals

The general goals of RAN:

Value: RAN should be a fresh markup/data language that maximizes bang and minimizes buck
Use-cases:

RAN should support both documents and data better than XML
RAN should support new use-cases not well supported by XML or JSON etc.

These include scientific, technical and financial data, large text corpuses, and data-binding
It should address some limitations that puzzled users of XML

Technical: Feature and syntax choices should not shoot RAN in the foot as far as optimizations.

Rapid Access Notation (RAN) is a possible document format current, currently under design, to allow fast and efficient semi-random access to elements in fragments on the raw text with lazy or deferred or multi-threaded parsing¹.

It is designed to be lighter-weight than XML by being MUCH lexically richer (rather than use schemas), with much richer datatypes and tuples, more expressive delimiters, better relation/graph support, support for unlimited streams and built-in internationalization, accessibility and hypertext support.
It is designed to allow richer infosets or object bindings without requiring any schema loading or processing.
It has features to efficiently support fragments of interest in very large files, coarse-grained customization of transmitted documents, and different table layouts.
It is design to take advantage of features of modern SIMD and GPU processors.

Potential Name Change in 2025: As RAN is stablizing, useful to think about a buzzier name; ideas include:

ONEML:

Overt (no hidden rules or lookups or schemas for typing)
Neat (better rules about that attriutes do, qnames etc) or Neighbourly (divided into parts)
Explicit (no overloading or context for parsing)
Markup Language

SOLEXML:

Streamable (infiinite stream of fragments) or standalone (no schema or entity declarations necessary)
Obvious (clear rules for what every tag or token means: no overloading, datatypes defined) or Overt (no context-tree parsing or lookup needed for lexing)
Lazy (don’t need to parse things not needed)
Explicit (no schema necessary for typing etc)
Markup Lanaguage

Overview

The basic structure of a RAN document is a tree of specialist structural elements, using various variants of the familiar element/attribute tag, for:

a finite stream (using <<<< and >>>> delimiters) of

fragments (using <<< and >>> delimiters) containing

scopes/dynamics (using << and >> delimiters) containing

branches of elements (using < and > delimiters)

a special anonymous element <_> and </_> allows lists, tuples, etc.

A RAN file can be a single branch (like a XML document or a single scope (like an XML document), or a series of fragments, or a single finite stream of fragments with an explicit open and close.

Element and attribute names and attributes may be literals (with double quote) or tokens. The token types have simple lexical rules to distinguish them without full parsing.

The token types are influences by XSD datatypes but take them further: for example, for dates it supports more of the features of ISO8601 including date ranges, date wildcarding and uncertainty.

Linking metadata, such as direction, is assisted syntactically, by providing a range of delimiters between attribute names and values.

One way of understanding many RAN design decisions is that a lexer or parser can start at any point in the document and it only needs to find the first preceding (or following) < or > delimiter in order to know how to parse the document from that point. To allow this, “<” and “>” are always delimiter characters no matte where they appear in the document (you cannot “comment out” tags), namespaces may only be declared at the top-level of the document, there are no CDATA sections. This allow parallel or reverse parsing.

RAN “simplifies” XML by removing features that prevent fast parsing using modern processors (parallelism, SIMD, GPUs) but enhances it by making more advantage of delimiters and lexical typing.

Small Examples

Here is a very simple RAN document, like XML:

<book id===eg2 alt="an example">
    <!-- A comment -->
    <p>Hello world <b>!</b></p>
</book>

The first unusual thing compared to XML is that the first attribute on book uses "===". This delimiter indicates that the attribute is a primary identifier for that element within its scope (similar to an XML ID): an implementation may index this element for faster reference or retrieval. (There is also a corresponding delimited “=}” for IDREF.) RAN documents do not need to have a schema or schema to indicate this functionality.

In RAN, the text in between data is text; however the values of attributes can be typed data. RAN provides an extremely rich set of complex datatypes, and these are reliably determined using RAN’s datatype rules. All the named entities for Unicode characters are built-in, and can be used anywhere (and they won’t be recognized as delimiters or whitespace.)

The second unusual thing in the example above is that the attribute value does not use double quote delimiters, which means it is a token that will be lexically typed: in this case it is a name token. The lexical typing rules allows far richer types than other data-transfer or schema langauges: in particular, it allows information vital to interpret some number to be kept with the number. Here are some examples of lexical typing of rich values:

<lexically-typed-attributes>
   <quantities
     frequency=32_Hz temperature=32°F reading=-5.123e4 transmit=0xBEEF
     ></quantities>
   <dates
     birthday=2020-06-06TZ
     era=1995-05-?01/2024-10-?11 ></dates>
   <currency   
     polish-amount=¤100.10zł_PLN  old-uk-amount=£4.3s.8d euro-amount=€100
     ></currency>
   <tuples
     cords=[ 134.5°  126° 18km 2024-12-01T10:20:10 ]
     amount-at-time= [$100 2024-01-01T12:00 ]
     ></tuples>  
</lexically-typed-attributes>

In the example above, you can see:

Quantities: RAN extends the idea of numbers to allow units, degree signs (temperature, rotation), scientific notation (e-form), hexadecimal
Dates: RAN implements most of ISO8601 including timezones, ranges ("/"), and uncertainty ("?")
Currency: allows prefixes of any currency symbol like $ or €, and names and locales. ¤ can used when there is no symbol.
Tuples: an attribute value can be made from multiple values using the tuple delimiters.

These allow much better datatyping for scientific, historical and financial information. No schema is needed.

Fragments: A RAN document can be an unbounded stream of fragments:

<<<"I am a fragment"  id}=f1  info=※some info※ >>>
   .... 
<<</"I am a fragment"  id}=f1>>>
...

In the above example, there is a single fragment: it uses <<< and >>> delimiters. It has an ID attribute. It also has an attribute with a literal delimited by a non-ASCII delimiter. Unlike XML, you can see that tags names can be string literals (e.g. "I am a fragment") not just name tokens. Similarly, you can see that the end-tag has a matching ID attribute, to help with random-access processing. Fragments also allow parsers to skip past sections that will not be needed, for efficiency in retrieving data from large documents. (To indicate that a stream is finite, you can wrap the fragments in <<<< and >>>> tags.)

1 For information on the kinds of parsing and application that motivate the design of RAN, see the papers “Mison: A Fast JSON Parser for Data Analytics” (Li, et al., 2017) and "Parsing Gigabytes of JSON per Second" ( Langdale and Lemire, 2020)

Rapid Access Notation 乱

Goals

Overview

Small Examples

Details