Extended Lexical Datatypes

RAN attributes names or values can be marked up with double quotes These are treated as strings, with no parsing of their contents. (Unlike XML, apostrophe ' cannot be used as the delimiter for a literal.)

However, if they are just tokens with no double quotes, then their lexical signature will be used to assign them to a datatype, which they should conform to. These types are very extended compared to, e.g. XML Schemas Datatypes or JSON Datatypes.

UUID and primary key attributes are treated as (a tuple of) string literals, even if tokens are provided. Lexical typing of tokens only applies to key and general IDs.

A start- or end-tag can also include ellipsis “...” which indicates that the markup is known to be incomplete: this does not affect the lexical typing of tokens, it merely means a validator should not report errors relating to absent or incomplete items.

One advance of XML over SGML was that a DTD was not needed to set how to interpret delimiters. However, XML Schemas brought back heavyweight schemas, because they were needed to know how and in which contexts to parse values according to their datatype. RAN provides lexical rules so that it is not necessary to have a schema in order to have highly-typed attribute values. (Typed data is not validated as such, it is lexically recognized.)

Type Hierarchy

The type hierarchy is:

Literal (string) - starts and ends with e.g. double quotes
               - or starts and ends with any Unicode Punctuation Character
               BMP , excluding combing, or separator ch
               excluding delimiters <>&=/ etc. If the character has an opposite-facing twin, that must be used to close.
           So ※hello※   or «hello» or »hello«
Ellipsis - "..." used to signal some kind of incompleteness or elision
Lexical Token

Hex number string (RGB) starts "#" <-- removed
Money - starts “$” | Unicode currency

currency
amount

default

with/without date/date-interval

standard

with/without date/date-interval

regional

with/without date/date-interval

External - starts and ends with "%"
URI contains :// )
Relative-path starts "./"
Quantity - starts “+” | “-” |([0-19]* | infinity) any*

integer

base 10

with/without standard/regional units

hexadecimal

with/without standard/regional units

decimal

decimal notation

with/without standard/regional units

e notation

with/without standard/regional units

infinity
fraction ??

Anchor-path contains “:” no "/" ???
Modern Date (ISO 8601) starts with digit and contains two "-" not consecutive

point date

simple date
date-time

with/without wildcard
with/without qualification

range (two point) contains "/"

Symbol

Logic

basic: “true” | “false” |"yes" | "no" | “y” | “n”
extended: “true” | “false” |"yes" | "no" | “y” | “n” | "both" | "neither" |"unknown" | "possible" | "impossible" | "necessary" | "non-necessary" | "unknowable" | "missing" | "impossible" | "error" | "dontcare"
implementation/session-dependent registered alias

Name (catch-all, no ASCII whitespace or brackets)

including “null”, "NaN", "qNaN", “sNaN” (IEEE 754)

Regular expressions

Quantity : ^[\+\-]?(\u221E|[0-9]) <- any* following
ISO8601 Date : ^.+-.+- .+$
Money: ^\p{Sc} <- any* following
URI : ^(([a-zA-Z][a-zA-Z0-9\.+\-]*:(//|.+@)|urn:|ftp:) <- any* following

Token Classification

Tokens are substrings inside tags that are separated by whitespace and RAN delimiters (<> = " []). A token (i.e. no a literal string or ellipses or underscore) is a lexical-token that can be lexically typed if, after substituting any character references:

it starts with any non-alphabetic ASCII character or a currency character or ∞
it contains any non-alpha/digit ASCII character other than _ in any other position

Otherwise the token is a symbol. A symbol therefore cannot start with a non-ASCII character or digit. It cannot start with . of have any occurrence of -, unlike XML. The underscore character must be used, or a string literal, like JSON. Otherwise, any non-ASCII character is allowed. A RAN parser may provide a warning where a name uses a character not allocated by Unicode, or not following PUA conventions, or is a combining character by itself, or does not follow XML name rules: such breaches do not make a document well-formed.

LEXICAL-TOKEN  ::= PREFIX | INFIX | SYMBOL

PREFIX        ::= MONEY | RELATIVE-PATH | EXTERNAL |  QUANTITY

MONEY         ::= ("$" | {>ASCII} && \p(Sc})  any+ 
RELATIVE-PATH ::= "."  any+                       
EXTERNAL      ::= "%"  any+ "%" 
QUANTITY      ::= ("+" | "-")?  ( infinity | digit any* )
infinity      ::= ∞ | ( 0xE2 0x88 0x9E ) | "&infin;"

INFIX         ::= ISO8601 | ANCHOR-PATH | URI 

ANCHOR-PATH   ::= any+ ( "+" | "~" ) any+
URI           ::=  any* "/"  any+ 
ISO8601        ::= [01-9X\?%\~X] [^\-]* "-" [^\-]+  "-" any+

SYMBOL          ::= any+

These productions are in order, with SYMBOL as the fall-through case. The SYMBOL production covers names and logic identifiers.

Note: This means that a SYMBOL token is slightly different from XML's names, and cannot starts with "." or have any occurrence of "-". Element, attribute, link and IP names should not be the same as a logic keyword.

Lexical typing of Tokens

Any attribute value that is not specified inside double quotes is parsed (lexed) using lexical typing: the presence of some non-letter character in the token determines its type.

token         ::= lexical-token | symbol
lexical-token ::= quantity | money | iso-8601
    | uri  | relative-path | anchor-path | external
symbol        ::= logic |  name

Logic

Familiar: The traditional Boolean tokens of "true" and "false" are recognized as logic values. Declarative XSLT-style “yes” and “no” are aliases. (Localization: an implementation can register other strings as aliases of these values.) Case-insensitive.

Extended: RAN recognized various tokens useful for major 3- and 4-value logic : no system of logic is implied, the system only provides token recognition.

For 3-value logic:

for SQL-style 3-value logic, where there is some comparison with NULL, the token "unknown" may be appropriate;
for Kleene-style 3-value logic, where no information is available, the token "unknown" may be appropriate;
for Jan Łukasiewicz-style 3-value-logic, where the value has yet to be determined (i.e. for a future event) the value "possible" may be appropriate

For 4-value logic:

for Belnap-style 4-value logic,

"both" means conflicting reports have been made, or that some parts are true, some are false.
"neither" means no information is available, or that some answer other than Boolean true and False are

for SAE J1939 CAN logic:

"error"
"unknowable" may be used for the CAN Not Installed, indicating that no e.g. Boolean answer was possible because the component to determine it was not installed (such as a sensor.)

for Stoic modal logic, loosely:

"possible" - can become "true"
"impossible" - cannot become "true"
"necessary" - cannot become "false"
"non-necessary" - cann become "true"

For Larson logic (?)

unknown
unknowable

Some ad hoc additions:

"dontcare" is also available (from IEEE 1164) appropriate

a result was not calculated for pragmatic or role reasons,
a Boolean value was not

"missing" as logic equivalent to NULL.

logic         ::= boolean | choice | indeterminate | modes | unavailable | significance
boolean       ::= “true” | “false” 
choice        ::= "yes" | "no" |  “y” | “n”
indeterminate ::= "both" | "neither" | "unknown"
modes         ::= "possible" | "impossible" | "necessary" | "non-necessary"
unavailable   ::= "unknowable" | "missing"
significance  ::= "impossible" | "error"  | "dontcare"

Example:

Here is an example of ad hoc use of the logic datatypes.

<<<"Vote Report"        id:=vr>>>
   <vote value=true     name:="Pete">Pete agreed</vote>
   <vote value=false    name:="Computer">Computer says "no"</vote>
   <vote value=missing  name:="hoffa"
              >The vote by Hoffa is registered but not found</vote>
   <vote value=error    name:="Santa"
              >Santa is not real, so the vote is recorded as an error</vote>
   <vote value=impossible name:="Napolean"
              >The late French Emperor's vote is recorded as not possible.</vote>
   <vote value=possible name:="Julian"
              >Julian's vote is possible, but not yet received</vote>
   <vote value=both     name:="Vicky Pollard"
              >Vicky voted "Yeah but no but yeah"</vote>
   <vote value=dontcare name:="Lauren Cooper">Lauren ain't bovered</vote>
<<</"Vote Result>>>

Quantity

Familiar: The traditional numbers such as 5 and +6.2 and -099.999 are recognized as numbers. RAN supports hexadecimal numbers, with 0x prefix, such as 0xBEEF.

Extended: RAN supports metric, such as 99_kg for 99 kilograms. The underscore followed by a Système international (d'unités) metric prefix followed by any name is recognized. As well, conventions exist for regional units. European-style decimal commas are allowed.

Dimensioned: Partly following XBRL, a quantity may also have a date, date-time or date-interval attached.

quantity      ::= ( number unit? )  
number        ::= ( “+” | “-”)? ( decimal | hexadecimal )
decimal       ::=  DIGIT+ ((“.” | ",") DIGIT)? ("e" ( “+” | “-”)? DIGIT+)?
hexadecimal   ::= "0x" (01-9a-fA-F)+  (“.” (01-9a-fA-F) )?
unit          ::=  ("_"  si_unit) | degree | regional_unit  ("_" DATE_TIME_INTERVAL)?
si_unit       ::= metric_prefix? name (("/" | "·") name)* 
degree        ::= ( "°" | &deg;)  [A-Z]?
regional_unit ::= name "_" name
metric_prefix ::= "Y"|"Z"|"E"|"P"|"T"|"G"|"M"|"k"|"h"
                |"d"|"c"|"m"|"μ"|"n"|"p"|"f"|"a"|"z"| "y"

In general, one would expect that quantities would be marked up just as numbers, with an elided quantity that is given elsewhere, e.g. in a column header. Such column headers can represent the units using "0" introducer: e.g. <head-col number=1 unit=0_Hz >

Degree "°" is angular degree (0-360) or rotational degree, in decimal. (Would minutes and seconds be useful instead?)

Hint: An implementation may use this as a picture, to allow fewer characters in markup: e.g. the header column can have . <head-col number=1 unit=00000.00_Hz > and a data value of "440" would be padded to "00440.00", e.g. for comparison purposes. This is not required by RAN.

Note: Capitalization is very significant. Prefix "da" for x 10 is not available, as it used two letters.

Examples:

Simple numbers: 100, +1, -1, 1.000, -001.0
Exponent: 5.123e4
Hex numbers: 0xBEEF (uppercase)
Quantity with simple unit: 26_kg
Quantity with simple unit (derived): 18_kΩ
Quantity with simple compound unit: +83_kg·m/s
Non-SI quantity with regional qualifier: 24_pints_US
Negative infinity: -∞

"·" has a higher precedence than "/", as conventional. Compound units using "/" must start with "-" or "+" to prevent lexical clashes with paths. Complex primitive units, such as s² are not supported directly, nor any bracketing or digit: the units are not intended to provide mathematical markup. For compound quantities, make up your own unit, e.g.

An extended example is this:

<consumption of:=["beer" "alcohol"] 
             amount =[12.5_pints_UK  %1500-X ]
>12.5 Elizabethan pints</consumption>

In this case, we are defining that this current element has anchors of "beer" and "alcohol". It has an amount that is 12.5 of something called "pints" of somwhere called "UK" at a year of approximately 1500 (and any month or day).

Common Units of Metrology

Unit names are application-dependent and not built-in to RAN, but the conventional SI quantities of g, m, s are preferred. The 7 base units:

The standard 22 derived units: radian (rad), steradian (sr), herz (Hz), newton (N), pascal (P), joule (J), watt (W), coulomb (C), volt (V), farad (F), ohm (Ω Unicode U+2126), siemens (S), weber (Wb), tesla (T), henry (H), lumen (Lm), lux (Lx), becquerel (Bq), grey (Gy), sievert (Sv), katal (kay), degree celcius (^oC using U+00B0).

Additionally:

Common derived metrical units: liter (l), metric ton (ton or tonne)
Temperature units can be represented using "^o" ( using U+00B0)

Celcius "^oC", Kelvin "^oK", Fahrenheit "^oF"

Regional measures can be used, but must have the

For time, consider using the more powerful date-time-interval datatype.

Money

Familiar: $10.00 or Є10,00

Extended: Money can be an amount or a currency. Both start with a Unicode currency character a [01-9]+ number. Compound currencies are supported and the ISO 4217 currency codes.

Some examples of amounts:

$10 - 10 dollars (country not specified)
$10.00_USD - 10 U.S. dollars (ISO4217 3-alpha currency code)
$10.0_US - 10 US dollars (ISO3166 2-alpha country code)
₵10_GHS - 10 Ghanaian cedi

money       ::= sign (amount | currency)
sign        ::= ( $   |   ¤   |  
€   |   \p{Sc} )
amount      ::= digit+ letter* (("." | ",") digit+ letter)*  ("_" code)?
currency    ::= (commonname ("_" code )?) |  code
code        ::= iso4217  |  iso3166
iso4217     ::= letter{3}
iso3166     ::= letter{2}
commonname  ::= letter+ ("." letter+)

A currency that does not have a unique symbol must use ¤ to indicate a currency is being specified, and the country code. The choice of whether to specify the currency (using the common name, ISO 3166 country code or ISO4217 currency code, or some mix) is up to the user. The RAN lexer/parser will merely divide the money into its parts.

Further examples of amounts

¤100zł - 100 of the currency to be called "zł" i.e. "100 zł"
¤ 1000_IDR - 1000 Indonesian rupiah i.e. "1000 IDR"
¤1000Rp - 1000 of currency to be called "Rp" i.e. "1000 Rp"
¤1000Rp_IDR - 1000 Indonesian rupiah to be called "Rp" i.e. "1000 Rp"
¤10bucks_USD - 10 U.S. dollars (ISO4217 3-alpha currency code) with no currency sign specified, and to be called "bucks" i.e. "10 bucks"

Examples of compound currencies:

£4.3s.8d - 4 pounds, 3 shillings, and 8 pence (using "." instead of space)
¤100.10zł_PLN - 100.10 Polish zloties, i.e. "100.10 zł"
¤0zł.10gr_PLN - 10 Polish grosz , i.e. "0zł, 10 gr"

Examples of currencies:

¤Rp - The currency specified as "Rp"
¤ PLN - Polish złoty, the unambiguous currency
¤ zł_PLN - Polish złoty, the currency specified as "zł"
$dollars.cents_AUD - Australian currency, called dollars and cents

Example of using attribute tuples to group an amount, and a point in time (for exchange rate):

income =[ ¤ 1000000_IDR 2024-01-01 ]

Date Time Interval: ISO8601

Familiar: The simple ISO 8601 date-year-month wil be recognized, such as 2021-12-20. All dates must use the “-” delimited form ("Extended Format') in order to be recognized as dates (not URIs or numbers) and, at implementer discretion, be lexically-checked and type-converted.

The usual time and timezone indications can be used:

Date: 2021-10-07T

Due to lexical typing,

Extended: more of the capabilities of ISO 8601 are supported: intervals, uncertainty, wildcards.

The standard for dates, ISO8601:2019 has several enhancements¹⁹ whose lexical form is also allowed²⁰. These include

Specifying intervals, etc, such as 1999/2021 meaning 1999 to 2021. An interval may be open-ended such as 2000/.. or unknown ending such as 2046/
Wildcarding dates, etc. Such as 202X-XX-01 meaning the first day of any month in the decade of the 2020s.
Indications that a date etc is approximate, uncertain or both. Such as 1000-01-01~ meaning the approximately first day of that millennium, with some uncertainty. Or a birthday of %1818-?01-?15 signifies that the year is approximate but the date and day are uncertain.
Putting these together, we can specify a time range as X-XT12:01/X-XT13:01 meaning times from 12:01 to 13:01 on any (or every) day. (The X-X is needed to satisfy the detection requirement of having at least one “-”.)

As well the “open end-time interval” and “unknown end-time interval” of level 2 are allowed.²¹ As well the group and individual qualifications of Level 3 are allowed, to represent uncertainty, unspecified and approximate.²² Unless the standard precludes, the following patterns are possible²³:

modern-date ::= date-time-interval | range
range ::= date-time-interval “/” date-time-interval

date-time-interval ::= date ( “T” time (“Z” | ((“+” | “-”)? shift))?)?
date ::= [year][“-”][month]([“-”][day])? Month or day precision.
24

year ::= qual? [\dX]+ qual?
month ::= qual? [\dX]+ qual?
day ::= qual? [\dX]+ qual?

time ::= qual? [\dX]+ qual? ( “:” qual? [\dX]+ qual? )+
shift ::= qual? [\dX]+ qual? ( “:” qual? [\dX]+ qual? )?
qual ::= ? | % | ~

Hex Number String

NUMSTR = "#" [01-9a-zA-Z]+

A number string is intended for tuples of binary data in hexadecimal, in particular RGB and RGB24. Each length will be interpreted differently:

#AB = [ 0x0A 0x0B ]
#ABC = [ 0x0A 0x0B 0x0C]
#ABCD = [ 0xAB 0xCD ]
#ABCDEF = [ 0xAB 0xCD 0xEF] e.g RGB

#ABCDEFA = [ 0xAB 0xCD 0xEF 0xA0 ]

#ABCDEFABC = [ 0xABC 0xDEF 0xABC]

#ABCDEFABCD = [ 0xABC 0xDEF 0xABC 0xD00]
#ABCDEFABCDE = [ 0xABC 0xDEF 0xABC 0xDE0]

#ABCDEFABCCDEF = [ 0xABCD 0xEFAB 0xCDEF]
#ABCDEFABCCDEFABCDEF = [ 0xABCDEF 0xABCDEF 0xABCDEF]

Locators

Anchor Path

An anchor path is like an XPath, except that you specify the fragment key plus any anchors of elements. For example:

"f1+e22+g44" means "the (first) element with anchor of "g44" contained in the element with anchor "e22" contained in the fragment with fragment key of "f1".
"*~person~city" means "all elements with anchor of "city" in all elements with anchor of "person" in any fragment. This is a 1:many link, with wildcarding.
"*~" is a wildcard meaning all fragments.

anchor-path   ::= "*~" |      
          ((number | name | "*" )
           (("+" | "~") (number | name ))*
           ("+" | "~") (number | name | "*")?)

Relative Path

A relative path is a directory path starting from a current location provided by the application. For example:

../x/y

 relative-path ::= “.” "."? ((“/”) ( ".." |TEXT+))*

To specify an absolute file path, use e.g. a URI starting with "file:"

URI

A URI follows the W3C conventions.

uri    ::= LETTER{1..16} “/” TEXT+

External

This is the name of e.g. a system or environment variable or function, such as the filename, host or current date. The parser does not de-reference this automatically, access to each name must be specifically enabled. The value is a text string that has no direct “<” or “>” characters that could confuse some parser implementation. E.G., %HOST_NAME%

Name

Any other token is treated as a name.

name ::= name-token (“:” name-token+ )?
name ::= [^\.\-][^-:]*

A name should follow the constraints of XML names. For RAN, syntax checking should confirm that, for any character < U+00B0 is only has the alphabetic and numeric characters and ":", ".", "-" or "_": i.e. no control characters or whitespace or other punctuation characters.

Tuples

RAN attributes may use tuples. Each token in the tuple is typed lexically as above. E.g. first tuple has a longitude in decimal degrees, latitude in decimal degrees, elevation in km, time. The second has degrees on he map, and also temperature degrees in Celcius.

<space-position cords=[ 134.5°  126° 18km 2024-12-01T10:20:10 ]>...
<temperature map=[ 134.5°  126° ] min-ave-max=[ 20°C 22°C  28°C] > ...

Footnotes

17Implementation Note: This functionality must be implemented and exposed in some API. Given a reference to F1:X123 the value can be found in unparsed raw text by by first scanning for the fragment (a linear scanning of the text for “<<[^/] until the fragment with @id of “F1” is found, then scanning start-tags until the next “>>” for the first attribute value of “X123”. This is a rough-and-ready text operation that can be performed on the raw text, to simulate ID/IDREF or keyed links, and relies on the target attribute value to be unique among all attribute values in the fragment (not only ID values: it has nothing to do with ID types).

19See http://www.loc.gov/standards/datetime/

20Implementation Note: The implementer may provide parse-time lexical checking according to the rules above, or some subset, or may defer it, or may make it a user option, or may only check for the “-” character. The implementer decides which features the transducer reports (i.e. what is exposed in an API.) If some other form of ISO 8601 date etc is required, it must be put in a quoted attribute literal and catered for as a string.

21Implementation Note: An example of an open end-time interval is “1985-04-13/..”. An example of an unknown time-end interval is “1985/” (Open or unknown start-times are not supported.)

22Implementation Note: A number in the date-time-interval may have an X instead of any expected digit, which is a lexical wildcard.

23Implementation Note: “%” before a year, month, etc, indicates that it is approximate. “?” indicates it is uncertain. “~” indicates it is both uncertain and approximate. “%” or “?” or “~” applies this to everything to the left.

24A year by itself will be treated as a number, not a year. The main point of the datatypes is that it allows lexical checking: the lexical checking of a year is trivial. The secondary point is reduce explicit type-conversion in clients.