Extension: RAN Pragma IP

(This section needs to be re-worked in consideration of the most recent adjustments to IPs.)

RAN Pragma IP is an adjunct to RAN, to provide information for optimized processing. A RAN Pragma IP would typically be consumed by the parser and not passed back to the application, as it has information specific to the particular bytes coming in, not the information content of the document.

Pragma IP Syntax

The general syntax of the RAN header is an IP with simple "IP attributes" :

"<?RAN" target* attributes* "?>"

A RAN Pragma PI may appear anywhere in the document. Only pragmas at the start and end of a document are active, others should be ignored (and preferably replaced by whitespace of the same number of bytes). There may be multiple RAN pragmas at the start and end, for topicality.

Because it is often useful to have a "magic number" signature at the start of a file, it is best practice to have a RAN pragma IP at the very start, akin to XML's header. This can be as simple as <?RAN?>
Because RAN only allows two encodings (UTF-8 with a BOM, or UTF-16 with no mistaken BOM), there is no need for an equivalent of the XML header's "encoding=" attribute.
The PI attribute name "version" is reserved. It may allow a minor-major number version, if that becomes necessary.

Aligned Fragments

These two IP attributes allow a scanner looking for fragment start tags to skip faster to fragment starts. The document is generated with particular padding, and the IP is written out (either at the beginning or the end) to give the appropriate parameters. The subsequent scanner on the receiving system can uses the parameters to optimize the scanning.

A typical use might be:

<?RAN align=12 scan=8 ?>

which means that scanning to find fragment-starts may:

Start scanning the document to find the first "<<" fragment-start open-delimiter.
After this point, it can skip to the next 4K boundary and look in the first 256 bytes for a "<<[^/]" delimiter. It can loop through the document.

Example:

If we have a document with 100 fragments of exactly 10k each, then unaligned RAN would be 10M;

an unaligned scanner that checked bytes makes 10M comparisons; O(n)
an unaligned scanner that used SIMD instructions to read 256 bytes makes about 41,000 comparisons (of those 256 byte chunks). O(n/256)

Generating an aligned file using the 12 and 8 values above results in a file of 12M;

an aligned scanner makes 2560 comparisons (of 256 byte chunks), 15 times faster (i.e. align/scan - 1). Of course, there are other cache and prefetch effects in play.
A non-aligned scanner reading an aligned file would of course be slightly slower because of the extra padding, but the top-level whitespace is not part of the information set of the document, so there is no difference in content.

Location:

A RAN Pragma IP with @align and @scan should only appear at the very start of the file.

Fragment Range and Count

These IPs give the range of Fragment Identifiers in the document and the count of fragments. This allows an application seeking a fragment in multiple RAN documents to exclude the current RAN document if the fragment identifier is out-of-range, or to curtail scans faster.

A typical use might be:

<?RAN low=x1234 high=a12345 fragments=4 ?>

These attributes at the start of the document override any in a pragma IP at the end, so use one or the other but not both. As more fragments are appended, a new pragma IP can be placed at the end.

A receiving system can use @low and @high to decide whether the RAN document could contain the fragment sought for. There is no implication that the @low fragment is first in the document, nor that the @high fragment is last.

A receiving system scanning for fragment-start tags in a RAN file can stop after it has found the @fragments occurrences. This can be useful where the last-sought fragments is quite large.

Ordered Fragments

This PI attribute specifies that the RAN document was generated with ordered fragments:

A typical use might be:

<?RAN ordered=yes ?>

When scanner is looking for a particular fragment (or list of fragments) by fragment key, it can stop as soon as it finds a fragment with a larger key value than it is looking for. This halves the number of comparisons needed, on average. O(n)

Ordering can also be used to use a binary chop approach: start at the half-way point of the document, scan for the fragment-start open-delimiter "<<<[^/?!]", read its keys (scan for each ==, stopping if other use of = is found), and exclude the range above the document value; recurse. O(log n)

Ordered Fragments can be used in conjunction with Aligned Fragments.

For an Ordered Fragments, if Fragment Range PI attributes are present then the @low PI attribute has the value of the first fragment and the @high PI attribute has the value of the last fragment in the RAN document.