Rapid Access Notation 

Rick Jelliffe (C) 2021-2024

The general goals of RAN:

  • Value: RAN should be a fresh markup/data language that maximizes bang and minimizes buck  
  • Use-cases:
    • RAN should support both documents and data better than XML
    • RAN should support new use-cases not well supported by XML or JSON etc.
      • These  include scientific, technical and financial data, large text corpuses, and data-binding
      • It should address some limitations that puzzled users of XML
  • Technical:  Feature and syntax choices should not shoot RAN in the foot as far as optimizations.

Rapid Access Notation (RAN) is a  possible document format current, currently under design, to allow fast and efficient semi-random access to elements in fragments on the raw text with lazy or deferred or multi-threaded parsing1

  • It is designed to be lighter-weight than XML by being MUCH lexically richer (rather than use schemas), with much richer datatypes and tuples,  more expressive delimiters, better relation/graph support, support for unlimited streams and built-in internationalization, accessibility and hypertext support.
  • It is designed to allow richer infosets or object bindings without requiring any schema loading or processing. 
  • It has features to efficiently support to fragments of interest in very large files.
  • It is design to take advantage of features of modern SIMD and GPU processors.

Overview

The basic structure of a RAN document is a tree of specialist structural elements, using various variants of the  familiar element/attribute  tag, for:

  • a finite stream (using <<<< and >>>> delimiters) of
    • fragments (using <<< and >>> delimiters) containing
      • scopes (using << and >> delimiters) containing 
        • branches of elements  (using < and > delimiters)
          • a special anonymous element <_> and </_> allows lists, tuples, etc. 

A RAN file can be a single branch (like a XML document or a single scope (like an XML document), or a series of fragments, or a single finite stream of fragments with an explicit open and close.

Element and attribute names and attributes may be literals (with double quote) or tokens. The token types have simple lexical rules to distinguish them without full parsing.

  • The token types are influences by XSD datatypes but take them further: for example, for dates it supports more of the features of ISO8601 including date ranges, date wildcarding and uncertainty. 

Linking metadata, such as direction, is assisted syntactically, by providing a range of delimiters between attribute names and values.

One way of understanding many RAN design decisions is that a lexer or parser can start at any point in the document and it only needs to find the first preceding (or following) < or > delimiter in order to know how to parse the document from that point. To allow this, “<” and “>” are always delimiter characters no matte where they appear in the document (you cannot “comment out” tags), namespaces may only be declared at the top-level of the document, there are no CDATA sections. This allow parallel or reverse parsing.

RAN “simplifies” XML by removing features that prevent fast parsing using modern processors (parallelism, SIMD, GPUs) but enhances it by making more advantage of delimiters and lexical typing.

Small Examples

Here is a very simple RAN document,  like XML:

<book id==eg2 alt="an example">
    <!-- A comment -->
    <p>Hello world <b>!</b></p>
</book>

The first  unusual thing compared to XML is that the first attribute on book uses "==". This delimiter indicates that the attribute is a primary identifier for that element within its scope (similar to an XML ID):  an implementation may index this element for faster reference or retrieval.  (There is also a corresponding delimited “=}” for IDREF.)  RAN documents do not need to have a schema or schema to indicate this functionality.  

In RAN, the text in between data is text; however the values of attributes can be typed data. RAN provides an extremely rich set of complex datatypes, and these are reliably determined using RAN’s datatype rules. All the named entities for Unicode characters are built-in, and can be used anywhere (and they won’t be recognized as delimiters  or whitespace.)  

The second unusual thing in the example above is that  the attribute value does not use double quote delimiters, which means it is a token that will be lexically typed: in this case it is a name token. The lexical typing rules allows far richer types than other data-transfer or schema langauges:  in particular, it allows information vital to interpret some number to be kept with the number.   Here are some examples of lexical typing of rich values:

<lexically-typed-attributes>
   <quantities
     frequency=32_Hz temperature=32°F reading=-5.123e4 transmit=0xBEEF
     ></quantities>
   <dates
     birthday=2020-06-06TZ
     era=1995-05-?01/2024-10-?11 ></dates>
   <currency   
     polish-amount=¤100.10zł_PLN  old-uk-amount=£4.3s.8d euro-amount=€100
     ></currency>
   <tuples
     cords=[ 134.5°  126° 18km 2024-12-01T10:20:10 ]
     amount-at-time= [$100 2024-01-01T12:00 ]
     ></tuples>  
</lexically-typed-attributes>

In the example above, you can see:

  • Quantities: RAN extends the idea of numbers to allow units, degree signs (temperature, rotation), scientific notation (e-form), hexadecimal
  • Dates: RAN implements most of ISO8601 including timezones, ranges ("/"), and uncertainty ("?")
  • Currency: allows prefixes of any currency symbol like $ or €, and names and locales. ¤ can used when there is no symbol.
  • Tuples: an attribute value can be made from multiple values using the tuple delimiters. 

These allow much better datatyping for scientific, historical and financial information. No schema is needed. 

Fragments: A RAN document can be an unbounded stream of fragments:

<<<"I am a fragment"  id}=f1>>>
   .... 
<<</"I am a fragment"  id}=f1>>>
...

In the above example, there is a single fragment: it uses <<< and >>> delimiters. It has an ID attribute. Unlike XML, you can see that tags names can be string literals (e.g. "I am a fragment") not just name tokens.  Similarly, you can see that the end-tag has a matching ID attribute, to help with random-access processing. Fragments also allow parsers to skip past sections that will not be needed, for efficiency in retrieving data from large documents.  (To indicate that a stream is finite, you can wrap the fragments in <<<< and >>>> tags.)

1 For information on the kinds of parsing and application that motivate the design of RAN, see the papers  “Mison: A Fast JSON Parser for Data Analytics” (Li, et al.,  2017) and "Parsing Gigabytes of JSON per Second"  (  Langdale and Lemire, 2020)

Details