The ULTRA Data and Text Notation

2025-08-12

ULTRA documents represents rich compound data items organized as tables or trees (elements) .

In comparison with existing data languages (XML, JSON, CSV, property files such as TOML, etc. ) it provides the following benefits:

better representation of the cohesion and coupling of complex data items allows safer, denser and better modeled representations, making sure that e.g. units and dimension are not separated from their scalar values;
lexical recognition of numerous datatypes allows better expressivity without needing to resort to schemas or implications or custom subsequent processing layers on the basic parsed document;
uniquely supports non-intrusive annotation of type and validation information on data items;
supports various efficient implementation strategies: modern CPU instructions such as SIMD, lazy lexing, no spurious whitespace nodes, infinite streams of fragments, column-oriented array storage, sparse tables, hashed keys or identifiers;
clearly indicates the intended use of each syntactical feature;
full separation of the lexical “I have found an X here” from the schematic "you should find an X here" so that schemas do not need to provide information about datatype, identifiers, keys, grouping constraints on "attributes of attributes", thus also removing a lower level of elements or reducing the number of attributes.

It provides straightforward import/round-tripping of popular languages such as CSV, property files, JSON, the direct import of XHTML and Canonical XML, and the ability to transport embedded notations such as MarkDown with direct labelling.

Item Pairs

The basic building block is the item: in XML terms an attribute (or group of attributes), in JSON terms a field (or perhaps an object). Tables and trees (elements) are constructed using data items grouped by delimiters into tuples (between [ and ]) or tags (between < and >)

The datatypes that are recognized are:

numerical

decimal, hexadecimal, scientific e-notation, complex numbers

SI quantities (e.g., 26km) and machine storage types (e.g., float32) can be specified

numeric symbols such as #NaN (not a number), -#∞ (negative infinity)
ISO 8601 (RFC-3339 plus wildcards) dates and times
currencies such as €26.00
temperature, time and geographic degree numbers
fixed field numeric strings such as #00FFAA__RGB

literals

"" literals with character references
'' literals with C-style / delimiters
`` literals with no character references

symbols

recognized symbols such as names and logical symbols
unrecognized symbols, allowing private lexical extensions but errors otherwise.

Items can be arranged separated by spaces or by commas (in which case the items have positional indexes, and null values are allowed by consecutive commas) and allow comments. Example: Here is a comma-separated list of simple items with values, including a null item which has a comment:

0x01,0x02,3, ## comment: this item is a null ##, 10.0, "some text" , -0.5

These can be all annotated with lists of metadata, given names or positions, be labelled by referential integrity function, have comments and additional rating metadata such as locale, style hints or validation status. Furthermore, items can be put as a sequence separated by “/” to represent ranges, file paths, fractions, complex units, or other highly-cohesive etc. Example: here is an item that uses some of these:

period = 2025-11-31/2025-12-31❌range

In this example, the name of the item is a simple symbol. The indicator “=” is the common one without special significance. The value is a range of two ISO 8601 dates separated by the sequence delimiter "/". However, some process has determined that there is an error, so it has marked the value with a rating, in this case the Unicode Cross Mark and some symbol "range": ratings allow processing annotations to be added distinct from the information set of the . The ULTRA parser identifies all these components: the literal, the indicator, the ISO8601 dates in a sequence, and the rating.

Items allow data values that are highly cohesive (i.e. only make sense with each other) to be coupled lexically, in a way that is not possible in other data representation languages. For example, XBRL requires that monetary values may always have not only the scalar value, but specify the currency, the date and perhaps the locale and scale. Example:

"Surfing&nbsp;Fund" = $26.5__AUD__2025-06-31__000000

In this example, the name is a string literal including a non-breaking space using a built-in character reference. In the value, the leading $ sign indicates a monetary value in dollars, and the series of annotations delimited by __ indicate the particular currency (Australian dollars), the date, and a scaling value. (The table format provides a facility to move these names and the annotations to a column header, so that only 26.5 is needed in the entry.)

Example: here is a string literal without numeric character recognition but with a metadata annotation to indicate that the string literal contains MarkDown text:

text = `#some text
It was a dark and stormy night.`__markdown

Items provide not only a way to couple highly cohesive data items, allow denser and clearer representation of the data, provide better inline typing and annotation of names and values, but also simplify schemas and data structures at the lowest levels.

Grammar

Document

# ULTRA
ultra       ::=  elements | table | xml-compat
elements    ::= "<?ultra.*?>"? ws* ( stream | frag+ | tree | markup )
table       ::= ((header names? decls?) | (names decls?) | decls   )    row*
xml-compat  ::= (*"<?xml.*?>" ws* "<!DOCTYPE.*>"?) | "<!DOCTYPE.*>" ) ws* markup

Introductory Examples

A simple table. Rows have a string, symbol, numeric value with metadata, and a symbol.

[[[["An example table"]]]]  
["Rafa", dog, 8__months, M]
["Lotus", dog, 12__years, M]
["Sweetie", cat, 10__years, F]

Complex sparse table with title in [[[[, column names in [[, positional rows allowing skipped rows, null fields, and a named field.:

[[[["An example sparse table"]]]]
[[offset, bias, reliablity,latitude, longititude, price]]   
1=[0x0000,,,,,]
2=[0x0000,,,26°N,,]
4=[0x0000,,,,,price=€65]

Simple Elements (XHTML compatability):

<!DOCTYPE html 
  PUBLIC "-//W3C//DTD XHTML 1.1//EN"
         "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html  xmlns="http://www.w3.org/1999/xhtml">
    <body><p id="p1">Hello&nbsp;World</p></body>
</html>

Complex Elements, showing some different kinds of elements.

<<<<document=="I am a stream of fragments">>>>
<<<frag==fragment1>>>
  <<div>>
     <p id==="p1">Hello&nbsp;World</>
  <</div>>
<<</frag==fragment1>>>
<<<</document=="I am a stream of fragments">>>>

Tables

Tables are used as the fundamental grouping for row/column-based data, such as CSV, spreadsheets and scientific or financial data. Named and positional datatying is available, as are null values and sparse tables.

# TABLES
row         ::= tuple 
tuple       ::= (name indicator)? "["  fields?  "]" 
                metadata* rating* sep?
decls       ::= (name indicator)? "[["  fields? "]]"   
                metadata* rating* sep?
names       ::= (name indicator)? "[[[" fields? "]]]" 
                 metadata* rating* sep?
header      ::= (name indicator)? "[[[[" fields? "]]]]" 
                metadata* rating* sep?

fields      ::= sep? ( wsv+ | csv+ )? 
wsv         ::= value-item sep? 
csv         ::= (value-item sep?)? comma sep?
comma       ::= ","
sep         ::= (ws (ws | comment)* (comma (ws | comment)*)) 
               | (comma (ws | comment)*)

The direct characters [ and ] always and only mean the table or tuple delimiters, and cannot be used in any other context. The fields content can be lazily evaluated.

Elements

Elements are used as the fundamental grouping for tree structures (property lists, JSON-style structures, XML-style documents.) The four different types of elements (markup, tree, fragment, stream) have different processing implications regarding whitespace and stream processing.

A Canonicalized XML document or XHTML document (with no comments, PIs or CDATA sections) can be read by an ULTRA parser. All ISO/W3C named characters are available as built-in named character references.

# ELEMENTS
# TODO: JSON arrays have droppped out?

markup      ::= "<" name-item attributes? ("/>" | ( ">" (data | tree | markup )* endelement )  ws*
endelement  ::= "</" name-item? ">"
tree        ::= "<<" name-item attributes? ("/>>" | ( ">>"  ((tree | markup )* | table ) endtree )  ws*
endtree     ::= "<</" name-item? ">>"
frag        ::= "<<<" name-item attributes? ("/>>>" | (">>>" ((tree | markup )* | table ) endfrag )  ws*
endfrag     ::= "<<</" name-item? ">>>"
stream      ::= "<<<<" name-item attributes? ("/>>>>" | ( ">>>>" frag* endstream )  ws*
endstream   ::= "<<<</" name-item? ">>>>"
data        ::= [^<>]*              # replaceable with XML character references
attributes  ::=  (ws+ ( attribute-pair | tuple) )+

The direct characters < and > always and only mean the element tag delimiters, and cannot be used in any other context.

Hint: The attributes may be lazily evaluated. After handling the name-pair, scan ahead for the next < or >: the characters found match the attributes.

Items

An Item is a much enhanced version of XML attributes and is used both to represent table fields and markup attributes: both the name and the value allow sequences of tokens of the rich lexical datatypes. The tokens an be annotated with type metadata. The indicator can declare the reference type of the pair. Names and values can be annotated with processing results or hints called ratings.

# ITEMS
name-item   ::= name ( indicator value)? 
value-item  ::= (name indicator)? value
attribute-item ::= (name indicator)? attribute-value

name        ::= sequence rating*  
value       ::= ( tuple | sequence ) rating*
attribute-value ::= sequence  rating*
sequence    ::= step ("/" step)*    
step        ::= token metadata*
metadata    ::=  "__" token 
rating      ::= ( "?" | "*" | "!" | Non-ASCII && \p(So) ) token?     <-- TODO "?" conflicts with 8601 wildcard? 
indicator   ::= sep? "="{1-4}"}"?

Examples

Sequences

time=start/middle/finish
speed=100km/1h
period=2025-02-01/2025-03-01
wage=$2/day

Metadata

progress=12.567__float64

The numeric decimal value 12.567 has an annotation which would indicate the storage type: that 8 bytes are required.

amount=$2__AUD__2025-12-31

Rating

gender=Z!unexpected

The value is a symbol Z. It is annotated with a ! rating delimiter (indicating something noteworthy, and the symbol unexpected: this might be the result of some kind of validation, where Z is not an allowed value.

speaker=Jane👩

The value is a symbol Jane. It is annotated with a rating character that is the emoji for woman: this might be some hint for a speech renderer for example.

Tokens & Whitespace

Extensibility: Tokens that do not conform to the lexical datatyping ae reported, but can be trapped and processed by custom systems, otherwise are an error.

# TOKENS   
token       ::= symbol | literal | numeric | undefined
literal     ::= rcdata | ddata | cdata
rcdata      ::= \" [^\"<>\[\]]* \"            # has & character references
ddata       ::= \' [^\'<>\[\]]* \'            # has c-style delimiters  /n /r /t
cdata       ::= \` [^\`<>\[\]]* \`            # no delimiters or references
symbol      ::= [a-zA-Z\0xFF-0x2FFFF][a-zA-Z01-9\-\:\.\0xFF-0x2FFFF]*
undefined   ::= [^<>\[\]\s]+                  # not a symbol: an error unless privately handled e.g. @xxx
numeric     ::= [[\+\-\#]? \p{Cs}? [0-9] [^\s<>[]/\p{So}\_]*   # See below
 
# NON-SIGNIFICANT WHITESPACE
comment     ::= (";" (\s|";"+) [^;\n<>]* ";"+|\n) | ("#" (\s|"#"+) [^#\n<>]* "#"+|\n)   
ws          ::= \s                            # space, tab, newline ASII only

Lexical Datatypes: Numerics

Lexical datatyping for Numeric tokens can be done lazily.

# NUMERIC LEXICAL TYPES 
numeric     ::= currency | date | based-number | quantity | exponential |  
                imaginary | decimal | symbolic-number |digitseq 

currency    ::= [\+\-]?  \p{Cs} [01-9][a-zA-Z01-9\-\:\.]*            
date        ::= [01-9][a-zA-Z01-9\-\:\.\?\~]*     # ISO 8601 date has an internal -. Ranges are done as sequences
digitseq    ::= "#" number 
based-number::= "0" [a-z] [a-zA-Z01-9]*       # 0x = hex  0o = octal 0b = binary 
quantity    ::= degree | si | miscquant
degree      ::= [A-Z]? decimal "°" (( decimal "′" (decimal "″" )?) | [A-Z] )?    # 123° or 123°F or N123°25′32″
si          ::= decimal [adf-hj-zA-Z][a-zA-Z]* # exclude scientific notation and imaginary
miscquant   ::= decimal [0xFF-0x2FFFF]+        # regional or specialist units 
exponential ::= [0-9] ("." [0-9]+) ( "(" [0-9]+ ")" )? "E" "-"? [0-9]+
imaginary   ::=
decimal     ::= [01-9][a-zA-Z01-9\.]*       # decimal numeric token is ascii only 
symbolic-number ::= [\+\-\#] symbol

Examples

Purchase order example

This is a version of the common XML purchase order example.

<?ultra?>
<<PurchaseOrder==="99503"   ## A version in ULTRA of the well-known XML example
  OrderDate=1999-10-20>>
  <<Address=="Shipping"
    Name="Ellen Adams"
    Street="123 Maple Street"
    City="Mill Valley"
    State=}CA__USA
    Zip/Postcode=}"10999"_USA
    Country=}USA  
  />>
  <<Address=="Billing"
    Name="Tai Yee"
    Street="8 Oak Avenue"
    City="Old Town"
    State=}PA__USA
    Zip/Postcode=}"95819"_USA
    Country=}USA
  />>
  <DeliveryNotes🚚🐕>Please leave packages in shed by driveway.</DeliveryNotes>
  <<Items>>
    <<Item =} "872-AA"✓
      Quantity=1 
      Price=$148.95__USD >>
      <ProductName>Lawnmower</ProductName> 
      <Comment⏳>Confirm this is electric</Comment>
    <</Item>>
    <<Item =} "926-AA"❌DISCONTINUED
       Quantity=2 
       Price=$148.95__USD
       ShipDate❌IMPOSSIBLE=1999-05-21 >>
      <ProductName>Baby Monitor</ProductName> 
    <</Item>>
  <</Items>>
<</PurchaseOrder==="99503">>

The changes compared to the XML version are:

The document has a "magic number" of "<?ultra?>"
Tree elements (using <<) are used rather than < in most places, so that there are no spurious whitespace nodes generated.
Typed values are all given in attributes. Element data content is only used for text strings that are in English language (i.e., product names, delivery notes, and comments.) The Address elements can thereby use the empty element form "/>>".
The identifiers for elements are put as value directly on the element name.

The PurchaseOrder element uses indicator === to signify that this is a unique identifier (i.e. like XML ID); this ID is repeated on the end-tag.
The Address elements use the indicator == to indicate that these are keys: non-unique values that nevertheless can be indexed for faster retrieval.
The Item elements uses the =} indicator to say that the value is a reference to some other identifier, not represented in the document.
The @State, @Zip/Postcode and @Country attributes use the =} indicator, which signifies that the symbol values belong to some external vocabulary. (An implementation could then use this information e.g. to generate an error if a validator is unable to resolve the list.)

The State attribute also has a metadata value __USA so that the value can be resolved atomically, without looking at other attributes (i.e. the @Country attribute.) In this example, the @Country attribute is still given, though it is redundant.
The former Zip element now has a name using a sequence separated by "/": @Zip/Postcode. Sequences are allowed in names, and generally will refer to alternatives. Again, the string value has a metadata annotation of __USA, so that the value explicitly has everything needed to validate it without looking up other attributes.
The attribute values do not say that e.g. @State should be validated against a table of states: it merely says that 1) the value is in an external vocabulary, 2) the value has metadata of _USA. An incremental on-demand validation microservice could be set up to handle a request "Validate an Address/@State attribute with metadata USA and value CA"

Attribute values that have a type are not given in literals to allow lexical datatyping. So Address/@OrderDate 1999-10-20 is recognized as a date (because it starts with a digit and has internal "-".) Instead of a USPrice element, we have a Price attribute, where the value is known to be a currency (by the leading $) and the particular national dollar is given by metadata (the trailing __USD). In this example, the state and country are represented using symbols not literal strings

Some additions compared to the original:

The top-level element has an internal comment: this starts with “##” and continues to a matching “##” on the same line, or the end of line (as in this case.)
Ratings annotations provide non-disruptive ways to annotate the document with information that subsequent process or systems may find useful:

The second item has some rating annotation:

a rating annotation on the part number value: ❌DISCONTINUED: this would indicate that some validation process has found a problem.

The first Item also has a rating annotation on the part number: in this case a check mark ✓ to indicate e.g. that the part is a current product.

a rating annotation on the @ShipDate name: ❌IMPOSSIBLE: this would indicate that the shipping date cannot be met (presumably because the product is discontinued.)

The DeliveryNotes element name also has two rating annotations:

a truck emoji 🚚 which might indicate e.g. that some AI has classified the delivery note as instructions for the attention of the courier,
a dog emoji 🐕 indicating that the address is known to have a problematic dog.

The Comment element name has an hourglass emoji ⏳ indicating (e.g. to some workflow system) that there is some time-based issue.
Ratings use their own kind of markup and so do not disrupt or intrude on the information in the purchase order, in the way that e.g. adding an extra attribute would. Furthermore, the rating is attached to the information that has the problem: the name or the value. Ratings allow the status of some information to be marked up, so the same document can be progressed through a workflow and annotated without requiring a schema change for the elements and attributes. A rating is distinguished from metadata in that metadata provides the information needed to make sense of the main value, while a rating has accidental information.

Bible Example

Another common example from XML is simple markup of a Bible.

<?ultra?>
<<<<bible===tyndale1537  lang=}en  info="Public Domain" >>>>
   <description>English Tyndale 1537</>

<<<book===genesis section==OT>>>
   <title>Genesis</title>
   <<chapter===genesis.1>><number>1</>
      <verse===genesis.1.1><number>1</>
      In the beginning God created heaven and earth.</>
      <verse===genesis.1.2><number>2</> 
      ...
   <</chapter==genesis.1>>
<<</book===genesis>>>

...

<<<</bible===tyndale1537>>>>

In the example,

The outermost layer uses a stream element, with the 4 < delimiters. This indicates that the contents will be a series of fragments that can be processed independently, terminated by the stream close tag.

The ID indicator === says that the symbol tyndale1537 is the identifier for this element. The ID is repeated in the end-tag.
The lang attribute uses the simple reference indicate =} to show that the symbol en is a value in some external vocabulary.

The next layer uses the fragment element, with the 3 < delimiters. The ID indicator === says that the symbol genesis is the identifier for this element. The key indicator == says that the element can be classified or grouped using the symbol OT. The ID is repeated in the end-tag.

Note: An ULTRA parser can efficiently skip through fragment text to locate a desired fragment, by looking for the “>>>” string using branchless SIMD instructions. (E.g Read 16 bytes of data into a register, treat it as 8 x ushort, match for ushort ">>" and store result vector, shift left 1 byte, match for ">>", then AND result with stored result. If not 0 then a section with “<<<" or “<<<<” has been found.)

The title has text, so it uses a normal markup element.
Each chapter uses a tree element, with a symbol as the unique identifier.
Verses and numbers contain text, so they use markup elements. In this case, they use short end-tags as well.
The lexer only needs to look for the immediately following and preceding tags to know if whitespace is significant, in some major cases. This promotes efficiency.

All whitespace which comes immediately before or after a stream, fragment or tree tag is not reportable or significant: the lexer skips over it.
Whitespace that comes immediately after a markup start-tag or before a markup end-tag is significant and reportable: it is part of the character data of the document.
The remaining case of whitespace, i.e., coming between a markup end-tag and a markup start-tag (such as between the two verses), is reportable by a lexer but the parser must determine whether it is significant or not (by looking at the context stack): if the parent of the whitespace is not a markup element, i is not significant.

Comparison with XML

Rationale

The notion of simplicity adopted for ULTRA’s design is that it does not simplify matters to remove things from a language which then require out-of-band communication or a schema to get back (let alone a DTD plus an XSD). (In a sense this may mirror the program of the WHATWG group in gaining overall simplicity by consolidating numerous layers into a single specification.)

Similarly, it does not simplify implementability by having any top-level lexical modality: it must be possible to navigate without fully parsing and to skip unwanted sections, or to start at any random location and determine the current lexical state by only going back or forward to the nearest tag.

Further, it does not simplify things by representing metadata that is needed to understand typed values (e.g. units, dimension) as if it were merely another attribute with no connection (as far as the lexer or schema is concerned) to the value. Instead, ULTRA encourages such metadata to be visually tied to the data value using the __ annotation, and for ranges and paths to be joined together using the sequence delimiter /. Furthermore, the advent of emojis, flags, currencies and other non-identifier characters such as check marks and cross marks in Unicode supports new kinds of visual markup and annotation (ULTRA ratings.)

So the bang-per-buck is maximised by providing much more bang for only a little more buck than naive minimalism would produce, building on the much richer syntax for items (attributes) shared by both tables and trees. The table and tree languages only take around 35 grammar productions combined, and the lexical datatyping only takes 25 or so more regular expression productions.

Furthermore, simplicity of implementation in the age of generative AI LLMs means having clear specifications of grammar productions or regular expressions and their intent, with simple layering, rather than the reduction of features.

Nevertheless we can start by removing various features:

Simplification:

External form is notionally UTF-8 only.
Data content is for text only and has no typing possible. Typed data must use attributes.
No namespace mechanism. A namespace declaration in the compatibility mode is merely another attribute.
No DOCTYPE or Schema required (or currently defined.) The Information Set of a parsed ULTRA document is more like the XSD PVI, allowing various type annotations.
No CDATA sections.
No comments.

Comments are instead allowed inside tags, delmited with e.g. ## or a newline

No PIs.

Rating annotations are allowed on item names and values, in particular using symbols, flags and emojis, allowing distinct annotation of items (attributes) with parameters or results of processes so as not to intrude on the information set.

The attribute name and indicator is not required for attributes (items) but values can be separated by whitespace.
The end-tag for a markup element may have the element name removed to reduce file size </>, at user discretion.
Thus a markup element only has untyped data content (with possible built-in character references) or sub-elements (markup elements or tree elements).

No whitespace immediately before or after a tree, stream or fragment tag is reportable (significant) whitespace.

Well-formedness is scoped to the fragment level only: a lexical error in one fragment does not impact the well-formedness of prior or subsequent fragments.

As well, items with tokens that start with a non-defined character (such as “@”) are available for private syntactical extensions (within the general constraints of tokens: no <>[] or whitespace or character reference recognition.) By default, an unrecognized token is a well-formedness error.

Consolidating DTD/Schema features:

Then we build in some on-ramps for better compatiblity and so as not to require schemas or DTDs:

All ISO/W3C/MathML character references are built in, resolving to single Unicode characters only. No entity mechanism.
A Canonical XML document and an XHTML document can be parsed directly and correctly by an ULTRA processor, under the compatability mode which allows but ignores a simple DOCTYPE declaration.
Lexical processing allows the following (and more) without the need for a DTD or schema:

ID, IDREF, IDREFS, KEY, KEYREF - set by the identifier such as === (ID) or ===} (IDREF), == (KEY) and ==} (KEYREF)
NAME, NUMBER
Internal Notations

Lexical enhancement:

Then we extend what delimiters can represent, to remove particular problematic points of XML (such as spurious whitespace generation, lack of streaming, lack of

Each basic type of delimiter has a richer set with particular significance:

As well as the expected markup element <x> there are

the tree element <<x>> which have no text nodes
the fragment element <<<x>>> which can appear in an infinite sequence, and be scanned for quickly using e.g. SIMD instructions
the top-level stream element <<<<x>>>> which provides a determinate finish for a stream of fragments.

As well as the expected general attribute indicator = there are

== the value is a key (a non-unique identifier for the element)

==} the value is a reference to one or more keys

=== the value is a unique ID, unique within the fragment.

===} the value is a reference to only or more unique IDs

===== the value is a UUID (a Universally Unique Identifier)

====} the value is a reference to a UUID

and =} the value is a reference to some value without any constraint of uniqueness or scope or format.
For example, the standard way to show a unique identifier on an element start-tag is <<book=="b1">>... where the value attaches to the name of the element.

As well as the string literals using the quote delimiter ", there are

A string literal delimited by back quotes `: these do not have character references recognized. The characters <>[] are always and only recognized as delimiters, and must always be represented using character references of some kind.
The string literal delimiter by apostrophe ' : these meansthat C-style delimiters such as \n and \0x00 are recognized instead of XML style.

As well as other datatypes, the [] delimiters allow tuples as attribute names or values. A tuple has a sequence of items, separated by commas or whiitespace. A tuple can have a name, metadata annotations and rating annotations. A tuple cannot contain a tuple. For example <dog understands=["sit", "down", "drop it", "ignore me"] />

The tuple delimiters are also used to represent tables, which can appear in fragments instead of tree elements or markup elements. These have no equivalent in XML and allow null values, positional parameters, annotation elision (column typing), and sparse tables.

To support fragments, robustness and navigation better, the end-tags may (markup elements and tree elements) or must (stream and fragment elements) have the identifier value in the end-tag, e.g. <<<section==="fragment1">>><p>hello world</><<</section==="fragment1">>>