Apatak - streaming parallel validation of arbitrary SAX-like streams

Apatak¹ is an approach to allowing parallel validation of arbitrary segments of parsed marked-up documents, using fuzzy, pair-wise checking of consecutive tokens. Apatak is characterized by extreme simplicity of implementation and has features to make it suitable for high performance, multi-threaded validation, and validation of arbitrary portions of documents.²

It may be regarded as a lossy one-way compression of a schema into small fixed-size tables or arrays sized to fit into the L1 data caches of modern computers.

It has been designed to complement the Rapid Access Notation (RAN) approach, however it could also be used on an XML SAX stream. Like RAN, it is a thought experiment.

This is an implementation technique, not a method of expressing the schema. It is particularly aimed as an implementation technique suited for XML/RAN documents validated against a relational schemas or DTDs (and XSD, RELAX NG, perhaps converted to the closest DTD.)

This document gives two versions of this approach and a design specific enough that it could be implemented:

the Combined Table approach is very efficient, regardless of name length
the XOR approach is better for medium and above names, but uses more instructions.

In fact, both approaches could be used together as they test slightly different things.

The scenario might be where a large RAN (Raw Access Notation) document has been divided in arbitrary segments, each segment is lexed to recognize the tags and tokens, and each of the tags is presented as a SAX-like stream for some kind of validation. The lexer provides complete tokens (e.g. a start-tag with all its attributes, an end tag, etc). The system developer has made tradeoffs about the level of false positives they will accept.

Because the start- and end-positions are arbitrary, they are likely not tag-balanced. Because the processes are parallel, there is no information about context before the current tag: in particular, no knowledge of the parent element or previous siblings. Therefore conventional implementations using grammars (DTDs, XML Schemas, RELAX NG) or Xpath-based approaches (Schematrons) cannot be used.

Partial Validation: If a start- or end-tag contains ellipsis “...” it indicates the element is known to be incomplete: in this case, error messages about incomplete or absent items (as distinct from items that are impossible under the schema) should not be generated, or filtered out, or demoted from ERROR status to WARNING status.

Coverage

An Apatak validator can express the same constraints as simple DTD content models and slightly more datatyping than DTDs. It can start or halt at any segment in the SAX-like stream.
Apatak uses a standard table table design and would be trivially implemented³. This design is intended to allow the table creation-system to select small or large tables to reduce false positives, and a design that may be friendly for CPU caching, indexing and branch-prediction.

The Apatak tables will conflate pairs that can appear in multiple contexts but have hash clashes.⁴
However, tables can be composed into chains, allowing smaller individual tables, rather than sparse large tables. This can be used, e.g., for more precise recognition of Han Ideographs for CJK documents.

The Apatak tables can be loaded by converting grammars to them, or by learning from a corpus of existing documents regarded as complete.

In the former case, there are a variety of cases where this tag-pair system captures the same constraints as an XML DTD content model: e.g. where any element only appears once in a single content model, even if that content model is used for multiple elements.
In the latter case, the result of problematic indicates that a hitherto-unfound tag pair has been detected (though there may be false positives from a hash clash.)
Composition can be used to reduce false positives

The system is not an automaton or grammar as such: there are no states⁵ or transitions or stacks. The result of checking one pair is not used to check any following pair.
The approach will become fairly useless when there are hundreds of similar elements that can be used in the same context, or where there is use of any kind of ANY. (A different hash method would have to be used where it is internal parts of names that are significant: i.e. if a Schema had elements for YearlyOverheadAmount, YearlyWagesAmount, YearlyIncomeAmount and so on.)

Approach

The approach is feasible-tag-pair checking. We create a series of tables to create an allowed-transition table where, for example, the rows and columns are element names the entry value is a boolean.

Combined Table Approach

We check the following:

For consecutive element start or end tags, create a hash from their names; use this hash to look up in a TagCheck table a nibble; use the nibble with a simple decider function check if that pair is feasible (1) or problematic (0). The hash is not necessarily unique. The nibble is four-bits: one bit for each case of the pair of tags being start- or end-tags: the decoder function selects the appropriate bit.
For each attribute of a start-tag, create a has from their names; use this hash to look up in a AttributeCheck table a nibble; use the nibble with a simple decider function bit to check if that pair is feasible (1) or problematic (0). The hash is not necessarily unique. The nibble is four-bits: one bit giving the feasibility, and three bits allowing simple lexical typing (say, 1 bit allowing empty values, 1 bit for string or token, and 1 bit for tokens or strings to start with a number or -): the decoder function checks the attribute value using the given criteria.
Optionally, for each attribute-value pairs, create a hash from their names; use this hash to look up in a ValueCheck table a nibble (or more); use the nibble with a simple decider function check if that pair is feasible (1) or problematic (0). The hash is not necessarily unique. The nibble (or more) allows further type checks, in conjunction with the result of the AttributeCheck.
For RAN, only attribute values that are name tokens will be checked; attribute values that are double-quoted literals or which start with numbers will not be checked.

We then allow composition of these. For example, if the TagCheck table result is feasible, we can then chain to another TagCheck table with different hashes to weed out more false positives. If a TagCheck result is problematic, the result is passed to some error handler for diagnosis and reporting.

As well as composition, Bloom filtering is allowed. A sequence of two sparse tables can be ORed to make a less sparse table. If the tables do not have overlap, then the table size can be doubled (i.e. to the combined size of the original tables)⁶ or even further increased: the performance trade off is optimal table size (whether an access pattern fits in with CPUs’ caching) against extra instructions (whether the sequence of operations fits in with the CPU’s instruction pipeline and branch prediction.)

Pair Tables

The “nibble” of each table can be characterized (initialized, accessed) by the following tuple of single-digit numbers:

[[row-bits, row-shift, char], [col-bits, col-shift, char], [field-select-bits, field-select-shift, char], bitset-bits⁷]

The access function for these provides (in effect)
get-bitset(u16 hash1, u16 hash3, u16 hash3, access-tuple) → bitset

For example if we have table [[5,0,-1],[5,0,-1],[2,0, 1], 4] then the get-bitset() function is called with

hash1 as the last character of the second name(-1),
hash2 as the last character of the first name (-1),
hash3 as the first character of the first name (1).

The get-bitset() function then generates the hash and lookup, based on the tuple:

Shifts hash1 right by 0 bits, masks the lowest 5 bits.
Shifts hash2 right by 0 bits, masks the lowest 5 bits.
Shifts hash3 right by 0 bits, masks the lowest 2 bits,
Adds the number of bits of the bitset
Concatenates these to get the bit address, and uses this bit address to find the appropriate fields container (byte or words)
Shifts and masks off the appropriate number of bits for the nibble.

Validating different kinds of pairs involves selecting the appropriate UTF-16 characters into the get-bitset() function, then interpreting the resulting nibble (as a bitset, uchar, etc.

Design

With this approach, and this table structure, here is a concrete design.

The tables are stored in an Apatak file, which is just a ZIP archive.
The default tuple is [[5,0,-1],[5,0,-1],[2,0,1], 2⁸] i.e. 14-3 bits, 2^11 = 2k each.
A control file allows declaration of other sizes for each file as needed, and shared tables for Bloom filtering, and specific diagnostic terminology for problematic pairs of each table⁹.
For the TagCheck table, the hash functions to provide the arguments for get-bitset() are: tag1-name[last()], tag1-name[last()], tag1-name[1] and the decider function treats the 4-bits of the returned nibble as belonging to start-start, start-end, end-start, and start-start combination.
For example, the default TagCheck table would be a 2k file where each word has the following layout, and each word is addressed by the hash1 and hash2 indexes:

For the AttributeCheck table, the parameter-selection function is the same pattern: tag1-name[last()], attribute-name[last()], tag1-name[1] and the decider function treats the 4-bits of the returned nibble as being a bitset of [allowed, may-be empty, token|string, starts-with-number]: it returns plausible|problematic using the allowed bit, it but may pass on validation of the other to the ValueCheck table if present, or it may validate with them.

For the ValueCheck Table, the hash function is attribute-name[1], attribute-name[last], value[1] and the decider function takes the ValueCheck bitset (nibble) and the AttributeCheck bitset and performs some kind of checking. This allows 32 different datatypes, plus empty, plus list, plus the RAN distinction between tokens and strings¹⁰, to be represented; I expect the datatypes would be related to XSD etc datatypes.
The names of the tables determine them. e.g. the default TagCheck table would be e.g. TagCheck.tbl. The file TagCheck-1.tbl is a chained table for if the whole segment is plausible. The file TagCheck-0.tbl is a chained table for if any part of the segment is problematic. Similarly for TagCheck-1-1.tbl , TagCheck-1-0.tbl and TagCheck-0-1.tbl and so on, in a branching fashion.

When creating and loading the tables, they can be sized appropriately to reduce hash-clashes to some acceptable level.

The system of using shifts and masks is designed in particular to support mixed ASCII and non-ASCII names. The first and last characters are very often successful hash names in XML. However, other character positions in the name or value may be selected, either from the start or the end (negative number.)

XOR Approach

This approach inverts the previous approach. It uses XOR of corresponding characters in the neighbouring tags. 6 arrays of 256 bytes are used.

The characteristics of this method includes:

Medium-length names are validated better than shorter names.

Checking two consecutive tags each with only a single ASCII character as their names applies only 3 boolean checks: may that character be used in the first position of any valid name, and may the XOR of the two characters be used in consecutive start or end tags (start-start, or start-end, etc)?
Checking two consecutive tags each with 8 or more characters as their names applies 24 boolean checks.

Validating medium-length name requires many more look-ups compared to the method above, however the arrays used should certainly fit into L1 caches.
Han Ideographic names are well catered for.

In the following, attribute value validation is not given.

Pair Arrays and Design

There are six pair arrays, containing HASH_SIZE bytes. Each byte is a bitset, with four bits corresponding to properties of the first four non-null bytes (or ushorts for UTF-16) in the name or name-pair being validated, and the last four corresponding to the last four non-null bytes. If the name as four or fewer bytes, the overlap will not be represented in the end-bytes.

Name Check Array: indexed by an 8bit byte
Start-Start Array (i.e. parent, first-child), Start-End Array (i.e. start and end tag of same element), End-Start Array, End-End Array, Element-Attribute: indexed by the XOR of two bytes after shifting the second byte left by one position (to maximize ASCII tag name coverage.)

So, when a new tag is found, the validator first checks its name against the Name Check Array:

For each of the first 4 non-null* bytes, it uses the byte value as an index to the Name Check Array. It then checks that the corresponding bit is set.

What does this check? Roughly: Is the first character allowed in the first position in any name of the schema. Is the second character allowed in the second position of any name in the schema? ditto three and four.

For each of the last non-overlapping 4 non-null* bytes, it uses the byte value as an index to the Name Check Array. It then checks that the corresponding bit is set.

What does this check? Roughly: Is the last character allowed in the last position in any name of the schema. Is the second last character allowed in the second last position of any name in the schema? ditto three and four.

Unless it is the first tag found, the validator next checks the names in the current and previous tag, using the Array that corresponds to whether the tags were start-tags or end-tags.

For each of the first four non-null* bytes in each name, shift-left the second byte by 1, and XOR them. Then use that byte as an index into the appropriate array. It then checks that the corresponding bit is set.

What does this check? Roughly: Is the XOR pattern of the first characters of the name-pair allowed in the schema? Is the XOR pattern of the second characters of the name-pair allowed in the schema? ditto three and four.

For each of the last four non-null* bytes in each name, shift-left the second byte by 1, and XOR them. Then use that byte as an index into the appropriate array. It then checks that the corresponding bit is set.

What does this check? Roughly: Is the XOR pattern of the last characters of the name-pair allowed in the schema? Is the XOR pattern of the second last characters of the name-pair allowed in the schema? ditto three and four.

2-, 3- and 4- Byte Characters

For UTF-16, if there is name of 10 ASCII characters, taking 20 bytes ( 0c0c0c0c0c0c0c0c0c0c where c is an ASCII byte) we exclude from comparison any 0 byte, and so the underlined characters are checked. It means that we check 8 characters.

For UTF-16, if the name is a sequence of six Han Ideographs taking 12 bytes ( HLHLHLHLHL high low) then we check the first four bytes and the last four bytes: i.e. the first two and last two characters.

(*) For UTF-8, there is an extra wrinkle appropriate for Han Ideographs. The higher bytes of a three- or four- byte UTF-8 sequence does not contain much information compared to the lower bytes.

Consequently, for UTF-8, as well as null bytes, we exclude from comparison any bytes that have the top two bits set. So if our name is a sequence of six Han ideographic characters, as 18 bytes in UTF-8 ( HMLHMLHMLHMLHMLHML high mid low), we validate the first and last pair of mid and low characters.

Footnotes

1 A PAir-wise TAg Knower

2 Namely, that a document can be arbitrarily divided into parallel validation threads, the tables have good locality, the method is not branchy, SIMD methods can be used to tokenize, no objects need to be created, and it could be folded into a parser without difficulty.

3 I estimate the entire could could be in one or two screens of code.

4 Especially in the case for attribute value checking, in the case of the same attribute name is used for different type, the check may be degraded to some lowest denominator.

5 Or, vacuously, only 3: working, finished-plausible, finished-implausible.

6 Where the hash function in two same-sized tables only differ by a single shift value by 1, then combining them into a double-sized table is always equal or better to the sequence of smaller separate tables.

7 By default, a nibble: 4-bits seems enough for Apatak. However, this is just a parameter, and may change.

8 I suspect 3 might be better: the table doubles, but it gives us 3 bits of matching of the first letter of the first name.

9 For example, that the problem is non-ACII text, or that the issue is capitalization.

10 The system would presumably be this: a scalar datatype must be represented by a RAN name, a string or list must be represented by a RAN literal (i.e., double-quoted string). A scalar would be validatable in a literal by declaring it as a list of tokens; however the RAN lexer does not look for tokens in attribute values, it is a different layer.

Version:

2021-09-21 Add XOR Approach
2021-08-09f Add Bloom filter approach. Mention XOR possibility.
2021-08-09e Add intro summary. Only valididate attribute values that are name tokens.
2021-08-07d Rename get-nibble() to get-bitset()