The esXML Data Format

Author: Stephen D. Williams, sdw@lig.net, swilliams@hpti.com

Date: 2006-01-15/2006-01-22, version 0.9

Version: W3C Efficient XML Interchange working group format submission

Notes

Introduction

Efficiency Structured XML (esXML) retains XML 1.0+ features while proposing a new structure in an open, proposed standard format. This structure is portable, optionally supports useful new semantics, and is efficient to the point of a processing-oriented mode that avoids overhead in parsing, consuming, traversing, modifying, and producing esXML and a compactness oriented mode. XML and esXML can be cross-converted with no loss of data and it is envisioned that XML libraries would support both. New optional semantics include highly efficient pointers, copy-on-write layering of arbitrary changes to a base document/object, and direct representation of binary content such as images or scalars. The proposed API for manipulation of an esXML object is called esDOM. esDOM is a simplified API to the esXML document/object, similar in concept to the C++ Standard Templates Library or other collection interfaces. esDOM is a library that manages extensible intermediation between an application and business object data. Implementing a type of zero-copy processing while supporting 4GL-like intermediation semantics in 3GL languages, esXML/esDOM is designed to radically improve common application development and accelerate processing.

esXML itself consists of two layers: the Storage Layer and the Representation Layer plus a MetaStructure Instance pseudo-layer.  The Storage Layer provides a virtual memory like service with certain useful features such as low level deltas and stable pointer maintenance.  The Representation Layer contains the linear bytes that represent structured data by structure, tokens, content, and metadata.  While the Representation Layer can be considered seperately, the full range of efficiency and features requires integration with the Storage Layer.  esXML instances can encode metadata that can be referenced to allow more compact encoding.   This metadata can be present in any esXML instance but it can also be referred to from a separate instance to provide a range of "externalized" encoding.

A fully self-describing data instance includes explicit structure, identities, content, and types.  XML provides explicit structure, identities, and content and, along with XML Schema or additional attributes or tags, a level of data typing.  Most existing data formats have some degree of "externalization" of one or more of these aspects of an instance.  Often the externalization is embodied in code or metadata but with most formats this externalization has a fixed degree.  In many of the most compact, and therefore most externalized, formats, there is no explicit structure or naming and content is expressed in minimal ways that only specific decoding code can understand.

esXML provides mechanisms that allow a broad range of externalization, from fully self-describing instances with self-contained subtrees to fully externalized compacted bit-wise representations of strings of raw data. With the ability to support automatic and dynamic encoding optimization, fully meta-data driven encoding and decoding, multiple simultaneous schemas and versions, and native data types, esXML provides the property of Generality.

Library Modes

In-Place Modification Mode (IPMM)

esXML is designed to be able to be manipulated in place for appending, replacement, inserts, and deletes for any data with minimal amortized overhead using both the Storage and Representation layers. This more complex library mode, which also supports efficient low level deltas and cheap stable pointers, is the "In-Place Modification" mode (IPMM).

Serialization Mode (SM)

A more traditional "Serialization" mode (SM) is also supported which converts events or a DOM to an esXML instance. The performance difference between these modes may be large for various scenarios. These modes represent distinct library API methods. When reading an esXML instance, random access through the use of indexes and rapid traversal based on structure are always available.

The MSI meta-data can be included in an instance and because every Element node may have a different encoding, hybrid format instances are easily possible. This can be important when a self-contained subtree is an instance of data that is exported and imported for transmission or other purposes or when multiple objects are combined into a larger object.

Processing Optimized Mode (POM)

The esXML Representation Layer token description, described above, is the default byte-oriented mode using the Storage Layer that is intended to be very fast to process but also reasonably compact. Except when specifically requested by an application, no analysis-based redundancy removal is performed (i.e. gzip/bzip2/etc. compression or search for optimal templates), automated optimization features are mostly or completely disabled, and when the in-place modification mode of the Storage Layer is used, holes created during processing are preserved until a compacting threshold is reached.

Space Optimized Mode (SOM)

Space Optimized Mode trades some processing overhead for better compactness. The base level of SOM has the same structure as the byte-oriented POM except that tokens and arguments are represented by entropy encoded (Huffman or Range) references to a special additional table, the EntropyTable, which can be included in the instance, externalized, or both. This means that not only can tokens be represented, but any sequence of bits can be referenced by a token. In particular, this could include a token, token+sint, or token+sint+value (full or partial). The resulting expansion is interpreted as in POM. The individual token assignments can be done simply on an as encountered basis on the fly or with statistical knowledge of the whole instance or a set of example instances. For instances with a small number of resulting tokens, even an on the fly method will produce great savings. Note that an MSI, described below, can indicate POM or SOM and other optional representation options for sections of data. The EntropyTable, like all tables, can be built incrementally as an encoder encounters and determines new entries.

When SOM is combined with the "data stream" mode of templates with a template that includes type table references, the result is similar to the encoding efficiency of ITU- T X.691 Packed Encoding Rules (PER).

Deltas - LLD, HLD, and Templates

The concept of a Delta is to represent in an efficient way the changes between an old instance and a new instance. A related idea is that of a "diff" (difference) which implies computationally intensive searching to create a delta. Use of deltas does not imply the work of diffing, although a particular implementation may use a computational method.

esXML supports two ways to represent Deltas. These are called "Low-Level Deltas" (LLD) and "High-Level Deltas" (HLD). An LLD is implemented at the Storage Layer by recording what bytes have changed between two instances, preferrably in in-place modification mode (IPMM) where it would be very cheap. The structure of parent and child storage layer instances is such that the effective byte sequence after addition of one or more deltas can be computed on the fly and in place without building an effective instance. This extends the zero-copy and low memory usage features of a storage layer enabled instance. An LLD could also be created computationally. An HLD is implemented at the Representation layer by template definition and reference with optional replacement of any component of the template. Template definition involves a template definition tag that encompasses any esXML subtree. Template usage uses a template reference tag along with optional insert, delete, replacement, or append based on serial numbering of each distinct field in the subtree. An HLD could be used both for on the fly optimization of streaming instances and as an MSI compactness strategy.

Note that both LLD and HLD can be used to transmit a stream of updates to a receiver. LLD allows cheap virtual construction of the result, while with HLD some type of bookkeeping update will likely be required. In both cases, the "parent" for LLD and the "template" (aka "parent") for HLD would be the "old" document specified by unique ID (see GUID/LUID).

Normally a reference to a template results in exactly the contents of the template plus any updates given in the template reference token. One important but subtle template mode is a situation where the basic structure is constant and useful data is streamed into place as updates. If this is implemented as an HLD, an update needs to indicate that the template is the parent plus any updates that have been received and applied. The canonical example for this scenario is cable/satellite program information where the overall data structure is fixed and sent periodically while program updates are non-conflicting child elements at some position. These child elements are also sent redundantly and must be expired out. The esXML template mechanism can be used to implement this except for automatic expiration (deletion), although periodic transmission of deletions for recently expired nodes would suffice.

Templates can be used in a way that is similar to memoizing or van Jacobson header compression, both well known powerful methods for reducing redundancy. The two main modes of a template are "use with listed replacements" and "data stream" replacement of all content fields. The latter can rely on type table references used in the template to avoid or minimize tags, lengths, lossyness, range, and cardinality. For example, when used in SOM, this implies bit-packed limited-range integers.

Compactness Analysis

Compactness involves efficiently representing structure, content, typing, and metadata both for self-describing instances and for those with various levels of externalization. In many cases compactness must be balanced with processing efficiency and other properties while in some cases compactness is a prime consideration.

esXML supports fast structure traversal, random access capabilities, explicit nested sizes, and other features that avoid parsing overhead. At the same time, there are compactness features of tokenization based on tables, types, runs of binary types such as floats or integers, and both low-level deltas (LLD) and high-level deltas (HLD). Additionally, there is both a byte oriented and a bit oriented encoding form of equivalent tokens. Last, any sequence of tokens, especially including table definitions, can be enclosed in a CompressionToken which can compress using a variety of methods. Note that this supports both "solid compression" where multiple items are compressed together or individual compression of tokens for local optimization.

The major frequency-based compression methods include dictionary (LZ77/LZ78, LZW, Deflate), entropy encoding (Huffman, Arithmetic, Range encoding), block sorting (bzip2), and lossy encoding. Dictionary encoding is supported by token tables. Dynamic dictionary encoding and a type of block sorting is supported by allowing multiple tables to be present and for tables to be restarted for any element. Entropy encoding is supported by the CompressionToken over any sequence of tokens, including tables. Huffman and Range encoding are supported at the raw token level for space optimized mode (SOM). Lossy encoding is supported by options in MSI which determine what information is decimated at encoding time.

These encoding options support extensive tuning of encoding strategies for particular application environments with widely usable features. By separating complex schema languages or other specification methods from the definition of the format and encoding options, the complexity is greatly decreased while increasing flexibility to support many different viewpoints with a small set of features.

Validation

esXML requires implementations to detect and properly indicate errors for any structural irregularities of an instance. This may however be performed on a lazy basis if an implementation is so configured and any corruption is not encountered during requested operations. For instance, a random access traversal that does not access corrupted bytes is not required to produce an error. Implementations must support a thorough validation method when requested by an application.

Structural integrity, sane tokens, and legal data values must all be validated and no data corruption may produce a buffer overrun or other compromise of the application environment. Schema and application validation are performed as layered operations and not as part of esXML beyond MSI processing.

The esXML Storage Layer has optional features to detect and optionally attempt to correct corruption at the chunk level through the use of redundancy check hash codes and forward error correction (FEC).

Conformance class

All esXML libraries must be able to read and interpret all structural features, encoding methods, and tokens. Space or resource confined implementations with restricted ranges of inputs can omit native type conversions and compression types that are not needed. All structural tokens must be interpreted and able to be skipped harmlessly along with extension tokens.

esXML libraries need to be able to encode and decode data regardless of the encoding methods needed which may be complex through the use of MetaStructure Instances (MSI). esXML allows any type of mechanism to be used to specify or compute the desired structure of encoded data which is represented by metadata in an MSI. esXML libraries use zero or more MSIs to encode and decode data. The classes of mechanisms used to create an MSI are not in the esXML basic conformance class but would be considered profiles or layered standards.

Layers

Storage Layer

The Storage Layer which implements the "elastic memory" structure and enables low level deltas and stable pointers, is a chunked sparse memory data structure. The storage layer presents a model of a linear array of bytes to the Representation Layer. This array of bytes can be modified by appending, overwrite, insert, and delete. Because chunks of memory are used to represent arbitrarily large data, data movement cost is minimized. Additionally, deletions cause holes within a block until further insertions or a condensation pass squeeze them out. Use of holes is only done for the in-place modification library mode. Memory space holes allow repeated updates of the same or nearby areas with very little cost.

The use of chunked memory may allow better buffer management in libraries and applications. The chunk size can be tuned as needed by applications.

Forward error correction can be supported within a block using Hamming-like codes and over a series of chunks by redundant exclusive-or'd redundant chunks.

One benefit of a Storage Layer enabled esXML instance is the ability to randomly access and directly modify an instance within the chunked buffer structure. This supports far better locality of reference at the CPU/cache level than traditional processing methods where each component is represented by an object, each data value or name by a string object referred to by those objects, and far too much object creation, linking, copying, and cleanup. This greatly improved locality of reference tends to make processing overhead less expensive and improves overall performance.

The structure of each chunk contains:

Representation Layer

The representation layer structure works with byte boundaries and lengths at the self-describing level. While content values may be encoded in a variety of ways, the structure is byte based outside of fully externalized elements.

Options

There are a number of options that can be defined for an instance or subtree. These options are encoded in the Option tag.

Options Mnemonic
Description
Encoding
e
POM or SOM encoding for remainder
XMLEncoding
X
XML Encoding used for entire document as text XML
Verbatim
v
Are whitespace and quoting preserved
Fragment
f
Is this instance a fragment or whole XML document.
Canonical
c
Canonical encoding is needed
Extensions
E
Definition of extension tokens as they differ from the default, allowing skip and indication of data loss.
SelfContained
S
Is this element self-contained?
MSI
M
MSI metadata or an MSI reference
Private
P
Is this node Private? This means that the tokens in this node to no contribute to a parent table in any way although they may refer to those tables, a kind of one-way, read-only self-containedness. This is needed so that a decoder can know that it is able to skip a subtree without parsing tokens.

New options and extension tokens must be defined by metadata that defines how to skip them for backward compatibility and whether the tokens are data, metadata, index, etc. For instance, any token that is structured differently than "token sint data" must have metadata defined in the Option section of all instances using those extension tokens. It should always be harmless (except with possible impact to performance) to ignore a token that provides indexing. Libraries would indicate data loss however for extension tokens that contain data.

Scalable Integers

All structural sizes are encoded with a "scalable integer" (a "sint") which can represent any integer value, although particular implementations may have a practical limit at 64 bits. Each byte contributes 7 bits of value, most significant byte first. The 8th bit is a "more bytes" flag. Generally there are only as many bytes as are needed, but in some cases the value may be padded to allow for changes without changing the sint size. This is particularly true of structural sints with the in-place modification mode of the Storage Layer.

GUIDs and LUIDs, Singletons and URIs

In certain cases, an instance needs to refer to an external object of some kind. This must be done unambiguously for a given context and it may be helpful to either support automation or rely on partially human readable values. An example where this is needed is for the parent document of a low level delta. In some cases the unique ID must be globally unique while in others a locally unique value between communicating partners is sufficient. For these alternate scenarios there are well known solutions. esXML allows any of these to be used when necessary. A globally unique ID (GUID) is normally created using the algorithm that was originally part of DCE and is now used extensibly by Microsoft and others. A locally unique ID can be anything, typically a serial number of some type managed locally. A standard GUID is a type of singleton, something that encodes enough temporal, location, and random information to nearly guarantee that no other system will generate the same GUID. A different type of GUID is based on a URI which, based on a unique domain name and path, is guaranteed not to interfere with another authority. Generating a serial number is needed for a URI approach.

Tokens

Tokens are defined in a way that optimizes the most common token usage yet allows for sufficient types of tokens and extensions. In particular, the common tokens of Element, Attribute, ValueBinary, ValueUTF8, ValueUTF16 can be combined with a small size indicator in a single byte. This means that tokens+values that are less than 14 bytes can be denoted in a single byte while content blocks of any size can be represented using the same token. In the following table, 'L' designates a short length indicator which is 0-13 for direct size, 14 for a sint definition table reference following the token, 15 for sint and value following the token.

The Continuation token allows additional data to be "added" to the end of any just ended token. This supports streaming, appending, and bounded buffer use while still avoiding excessive work searching for end tags. Because tokens are length-indicated and nested, just after the last data in the last nested element, all currently active elements will have finished at the same spot. The simplist example is that data can be added to the end of the content of a just finished element. The Continuation token has a "pop" level argument that allows data to be added to the end of any in a stack of just finished elements.

Byte
Mnemonic
Label Comment
1L
E
Element with nibble/sint-name, sint element size
2L
e
Element-tableRef with nibble/sint-tableRef for name, sint element size
3L
A
Attribute with nibble/sint-name followed by value token
4L
a
Attribute-tableRef with nibble/sint-tableRef for name, followed by value token
5L
I
Int - size - count MSB first ("network byte order", large endian)
Low nibblet: size: 0=8, 1=1, 2=2, 3=4,
high nibblet: count: 0=sint follows, 1=1, 2=2, 3=3
6L
U
unsigned Int - size - count MSB first ("network byte order", large endian)
Low nibblet: size: 0=8, 1=1, 2=2, 3=4,
high nibblet: count: 0=sint follows, 1=1, 2=2, 3=3
7L
u
unsigned Int - size - count LSB first (small endian)
Low nibblet: size: 0=8, 1=1, 2=2, 3=4,
high nibblet: count: 0=sint follows, 1=1, 2=2, 3=3
8L
8
ValueUTF8

nibble/sint is the count of bytes

9L
i
Int - size - count LSB first (small endian)
Low nibblet: size: 0=8, 1=1, 2=2, 3=4,
high nibblet: count: 0=sint follows, 1=1, 2=2, 3=3
AL
_
space Whitespace that's not part of a value. Values for: 1,4,8 spaces, 1,2,4 newlines, 1,2,4 tabs, newline 1,4 spaces, newline 1,2 tabs, sint spaces, sint newline, sint tabs
BL
S
Element - start/end micro-mode

high nibblet: 0=start/name, 1=start/tableRef, 2=end
low nibblet: nibblet/sint name/tableRef

CL
t
CustomType 0=sint value integer, 1=sint value negative integer, 2=sint date ms. from epoch,
3-14: type reference, 15=sint type reference, followed by sint size unless using metadata which indicates fixed size
DL
f
Float - size - count MSB first ("network byte order", large endian)
Low nibblet: size: 0=BCD (sint follows), 1=IEEEFloat, 2=IEEEDouble, 3=Number (sint follows),
high nibblet: count: 0=sint follows, 1=1, 2=2, 3=3
EL
v
ValueTableRef

Value table reference
nibble/sint table reference for "content" text, binary, etc.

FL
F
Float - size - count LSB first (small endian)
Low nibblet: size: 0=BCD (sint follows), 1=IEEEFloat, 2=IEEEDouble, 3=Number (sint follows),
high nibblet: count: 0=sint follows, 1=1, 2=2, 3=3
00
0
false False/0/No value, multiples are done with binary or int
01
1
true True/1/Yes value, multiples are done with binary or int
02
T
Table Followed by table type sint
03
H
Header (esXML header) Contains cookie("esXR" or "esXML"), version (may indicate max version of features used), options.
Can occur before any element to change options or verison and also support self-contained subtrees.
A header token is required whenever encoding is changed between POM and SOM *
04
U
use template sint template, sint insertions size, insertion data (sint, sint size, data)
This provides a high-level delta and structure+data reuse capability. The insertion data consists of indication of append, replace, insert, or delete, the token number (depth first), and optional character / byte offset.
05
P
ProcessingInstruction XML processing instruction *
06
R
EntityReference sint data reference
07
2
ValueUTF32 sint is the count of 32bit words
08
Z
CompressedValue (deflate/bzip2/etc.) sint type, sint size, data that encapsulates tokens and data
The contents of a CompressedValue can always be interpreted as if the CompressedValue enveloping was not present.
09
V
VirtualPointerRef sint indicating which pointer
0A
6
ValueUTF16 sint is the count of 16 bit words
0B
O
Options sint, option tokens
0C
C
Continuation

Allows additional data to be added to a previous token for streaming purposes. First sint is number of active tokens to pop, second sint is size of this chunk.
Note that some processing patterns will require that separate instances are generated and tied together via unique IDs as continuations of the same logical instance.

0D
d
Define template sint ID sint size
This defines a template to be used in this or other instances.
0E
E
Extended type Annotation-like extended type information
0F
X
Extended token Extended token given by next byte (* == may move to extended token in future versions)

Extended Tokens

Byte
Mnemonic
Label
Comment
01
8
ISO-8859-1
Alternate encoding
02
c
Comment
XML comment
03
B
ValueBinary
Translated to b64 on round-trip
04
F
External file reference
sint, filename
05
D
CDATA  
   
   

 

Binary Type Tokens

Even though there are tokens defined for binary types, including native types, these are only used when the application or an MSI calls for them. An application that requires only text-based XML values will store all values as text by default. The most commonly used binary type will be the opaque ValueBinary which equates to b64 encoding for XML roundtripping. In the simplest case of native binary usage, an application can store (insert, append, or set) content as a binary type (setInt32BE(..., 5), or setInt32(..., 5) on a big endian machine).

Date type handling

[Preliminary - goal is to unify date/time handling in simplest way.]

Information encoded is: byte: timezone - 5 bits, resolution - usec,msec,sec,day - 2 bits, negative of epoch - 1 bit, then sint for date value.

The most universally useful format for a date/time value is seconds, milleseconds, or microseconds from an epoch. The epoch base can be an issue for fully generalized date handling, solved either by an early epoch or other handling such as negative date/times.

For situations where month, day, year, and day of the week, and time without date need to be encoded, they are usually handled by different encoding. Since most milleseconds from an epoch values are large integers, little is lost by representing partial date/times as special initial ranges at the begining of the integer range and shifting actual date/time values. The scheme proposed here is: 0-6 day of the week, 7-18 month of the year, 19-49 = day of the month, and so on for times (3600 seconds in a day) and years (4000+ useful range).

Roundtripping

All structure, content, and metadata can be roundtripped with text XML. There are several levels of roundtripping that are supported by esXML libraries to allow stripping of typing and/or metadata and that affect the lossiness of details such as spacing and quoting.

Input Fidelity
Output Fidelity
Spacing
Spacing
Quoting
Quoting
esXML structural metadata
esXML Structural metadata
Indexes
Indexes
Typing
Typing

Tables

Tables are tokens and structure data that provide a compact dictionary or index of some sort. There are a variety of layouts for these tables to optimize size and lookup speed for the table entries. The two major categories are unordered tables such as a hash and ordered tables such as a tree or bisection list. In some circumstances there can be multiple tables of the same type and in most circumstances tables can be localized to any element of an XML instance. The latter feature provides not only localized optimization of table "compression" but also enables self-contained subtrees. A key example of the benefit of multiple tables would be having as many ValueTables as are needed to block sort content in a way optimal for compression. The difference between shallow, deep, and sparse index types is how much data is indexed. A shallow index indexes only immediate children. A deep index indexes the hiearchy of all descendents. A sparse index only includes certain items, possibly those that are known to be needed by applications while remaining compact by avoiding needless entries. (Note: Index types currently may include: list, one or more hash types, binary tree, btree, dictionary compressed Newtonian list. Additionally, the index may be element/attribute related or involve full text indexing of content.)

It is easliest to think of tables as prebuilt, fixed tables. esXML however allows tables to be built and maintained by a series of tokens throughout an instance. This allows not only extending a parent or externalized table, but also allows just-in-time definition of names, types, values, templates, or other table entries which can be reused after definition. The table types in the table below have two alternate forms: append to parent and reset.

Table Type Mnemonic Content
TagNameTable
e
sint size, sint type, data
NamespaceTable
n
sint size, sint type, data
EntityTable
y
sint size, sint type, data
ValueTable
v
sint size, sint type, data
Index - shallow
i
Followed by sint type
Index - deep
I
Followed by sint type
Index - sparse
s

Followed by sint type

TypeTable
t
sint size, sint type, data
TemplateTable
T
sint size, sint type, data
EntropyTable
E
sint size, sint type, data
Used to record token/name/length sequences for SOM encoded data

Table Compression

Just like any sequence of tokens in esXML, a table, or tables, can be enclosed in a CompressedValue token. While this will impact processing efficiency and memory usage, it may enhance compactness significantly.

Continuation Tokens

To understand how a Continuation token works, consider a set of nested elements where other token types are the terminals (attributes, "content" (i.e. text), comments, etc.).  If each element records nested length, then you can skip that element and all of it's children by looking for the next token at n+nestedLength bytes.  Each child element has the same property, as do terminal elements such as text nodes.  If you consider the end of the top-most element, there may be any number of elements and at most one terminal element that have just ended.  An immediately following Continuation token has an argument that indicates how many "active tokens to pop" and then a size bounded chunk of data.  Every token (zero or more elements and zero or one terminals) that ended on the last byte of the prior data is still "active" because they could legally have more data.  If the last content was a UTF8 text node or a binary (a la b64) node, for example, then popping 0 means the following text / binary data (respectively) is to be appended.  Popping 1 ends the previous content/value (or even an explicit name) and begins another.  In either case, after the initial Continuation contained token, additional tokens can be included as necessary.

This allows any tokenized structure to be split into any number of components and streamed coherently yet still maintains the ability to quickly traverse chunks of data and limits buffer usage if needed.
It may be pointed out that the receiver doesn't have a choice as to the use of continuation tokens on convenient boundaries, however:

esXML Structure

The esXML representation structure consists of a header token followed by sequences of tables, comments, processing instructions, file references, entity references, elements, attributes, and "content" (like body text, but optionally including typed binary or scalar data). Sequences of tokens can be defined as templates and used with updates. Stable pointer references can be given. Sequences of tokens can be enclosed in a CompressionNode. The effective XML structure must remain valid XML 1.x unless a subtree (or the entire object) is marked as a fragment. The primary requirements that this implies are enforcement of a single element root and placement of content.

Example showing nesting:

H cookie version options                
T Type sint data ...              
C sint data                  
P sint data                  
E sint name sint                
  I sint type sint data              
  A sint name valueToken              
  valueToken                    
 

E...

                   
e sint tokenRef sint                  
  ...                    

Namespace handling

Currently, namespaces are handled just as they are in XML with namespace declaration attributes and qnames. The element name is the qname interpreted according to the active namespaces.

MetaStructure Instances

A MetaStructure Instance (MSI) is a self-contained esXML instance that provides metadata about how to map an infoset to and from an externalized esXML instance. Externalization is the representation of structure, types, metadata, and data templates separate from a data instance. It is also possible for MSI information to be combined with data in an instance which works the same way without needing an external reference. esXML allows any type of mechanism to be used to specify or compute the desired structure of encoded data which is represented by metadata in an MSI. esXML libraries use zero or more MSIs to encode and decode data.

Examples of these mechanisms would include an XML Schema or ASN.1 specification and compiler or even a reflection-based Java serializer. Automated mechanisms include analysis of example instances that could produce an MSI that would provide efficient encoding.

A fully self-describing data instance includes explicit structure, identities, content, and types.  XML provides explicit structure, identities, and content and, along with XML Schema or additional attributes or tags, a level of data typing.  Most existing data formats have some degree of "externalization" of one or more of these aspects of an instance.  Often the externalization is embodied in code or metadata but with most formats this externalization has a fixed degree.  In many of the most compact, and therefore most externalized, formats, there is no explicit structure or naming and content is expressed in minimal ways that only specific decoding code can understand.

esXML provides mechanisms that allow a broad range of externalization, from fully self-describing instances with self-contained subtrees to fully externalized compacted bit-wise representations of strings of raw data.  With the ability to support automatic and dynamic encoding optimization, fully meta-data driven encoding and decoding, multiple simultaneous schemas and versions, and native data types, esXML provides the property of Generality.

Note that esXML and MSIs do not serve the same purpose as an Encoding Control Notation ([1]) in being able to specify legacy data formats or otherwise indicate any possible representation. MSIs indicate use of a restricted set of broadly applicable mechanisms that are intended to support most generally applicable best practices in data representation. The goal of the MSI mechanism in esXML is to provide a very simplified yet powerful set of mechanisms that can be implemented in a reasonable manner. Complexity such as an interface definition language (XML Schema et al, ASN.1, etc.) or specific language mapping and bindings is defined as external to the esXML specification since none of these topics will have universal utility. XML Schema or Relax NG mapping will be the canonical example IDL profile.

Levels of Externalization

Level     Description

Self-describing /
Self-contained

    Byte aligned token-length-value (TLV) nested structure.
External tables     External element / attribute names, namespaces, entities, and content tables are referred to by the instance.
Structural templates     An external representation of a subtree is used so that instances can use low-level or high-level deltas or otherwise refer to portions as shorthand.
Range and lossy     Values are stored in compact space based on their restricted range or lossiness specified.
Fully externalized     Completely dependent on MSI for definition of data, bit-packed range limited binary values and huffman encoded values, along with other MSI options.

MSI Options

Feature Mnemonic Description
POM/SOM
P
Is POM (byte-aligned tokens) or SOM (entropy encoded, bit-packed) mode used? Where are self-contained subtrees required?
Recorded in a normal Option token.
Tables
T
Table instances to be used for Element / Attribute / Namespace names, values, entities, files, etc.
Templates
D
both HLD (sequence of tokens, often a subtree, that can include names, data, and metadata) and rearrangement (a special form of template with forward and backward mapping between an original and encoded form based on name and/or positional matching).
One use of rearrangement is to group fields of the same type or statistically similar values together for better encoding performance.
Types
t
Information that a schema or IDL provides.
Types, ranges, desired lossyness, representation, cardinality, fixed fields, validation metadata: stored in a type table
Matched based on name and position. Encoded with CustomType token for both POM and SOM.
Compression
C
Compression token usage: indicates by name or position or token type which tokens are compressed and how.
SchemaExtension
S
Schema extension handling - flexible, fail on unknown, use Any meta-type
Structure
s
Streaming, depth/breadth, resliency, memoizing rules, and other structural influences.
Top level could be retransmitted as a keyframe with subtrees as template insertions.
Indexes
I
Indicates where indexes occur and using what format options.
CustomCodec
c
Signifies a layered custom codec. Data is handled opaquely but handed to the codec for interpretation. Examples include managing lossyness, specialty indexing (GIS, multidimensional proximity), and custom compression engines.

Metrics Report Instance

The esXML format supports selection of encoding methods and options that can provide optimizations for a range of data instances and usage patterns. One benefit of this combination of design choices is the ability for an implementation, either via an MSI producing tool or within application-level esXML libraries, to optimize data format options, possibly dynamically. In many situations, the MSI producing tool may not have access to actual instances or processing patterns from all points of a processing lifecycle. To provide the ability to gather this information from multiple sources for use in making encoding decisions, the Metrics Report Instance is a report of encoding, decoding, or access pattern performance. These reports are in a format that is consumable by any advanced implementation that automatically optimizes encoding choices.

esXML Evaluation

This table evaluates the esXML format with respect to the XML Binary Characterization Properties identified by the XML Binary Characterization Working Group.

Property esXML esXML - no Storage Layer Notes
MUST support
Directly Readable and Writable Yes Yes Multiple modes, all directly readable / writable.
Transport Independence Yes Yes Storage Layer adds built-in framing for extra capability.
Compactness Yes Yes Multiple strategies and methods for compact encoding and compression.
Human Language Neutral Yes Yes Supports standard character set encodings.
Platform Neutrality Yes Yes Structure is neutral, optional native type instances are "reader makes right" with defined set.
Integratable into XML Stack Yes Yes Can be accessed using SAX, DOM, or esDOM at least.
Royalty Free Yes Yes Format is free, open source example implementations.
Fragmentable Yes Yes Instances can be fragments; Storage Layer supports full low level deltas.
Streamable Yes Yes Depth first, tables up front, high and low level deltas, and especially Continuations.
Roundtrip Support Yes Yes Supports tunable levels of fidelity. Represents all content in XML.
Generality Yes Yes 20 or near 20, more finalization and testing pending.
Schema Extensions and Deviations Yes Yes Self describing mode is always available.
Format Version Identifier Yes Yes Standard.
Content Type Management Yes Yes Standard method suggested.
Self Contained Yes Yes Self-contained, self-describing mode with optional self-contained subtrees.
MUST NOT Prevent
Processing Efficiency Does Not Prevent Does Not Prevent Supports incremental access / creation, random access, avoids copies and conversions.
Small Footprint Does Not Prevent Does Not Prevent Believed to be true, tested soon.
Widespread Adoption Does Not Prevent Does Not Prevent Believed to become true.
Space Efficiency Does Not Prevent Does Not Prevent True by design inspection, test data soon.
Implementation Cost Does Not Prevent Does Not Prevent Verified by implementation
Forward Compatibility Does Not Prevent Does Not Prevent No obstacles.
MAY support
Accelerated Sequential Access Yes Yes Nested TLV, indexes.
Deltas Yes Yes Both low-level deltas and high-level deltas using templates.
Efficient Update No Yes The Storage Layer allows efficient in-place updates which are possible but not optimal without.
Embedding Support Yes Yes Direct support for binary blobs and native types.
Encryptable Yes Yes Direct encryption of self-contained subtrees and native and XML canonicalization.
Explicit Typing Yes Yes Typed data supported optionally.
Extension Points Yes Yes New tokens can be defined.
Human Readable and Editable No No Structured, binary data not editable without software, except when round-tripped.
Localized Changes Yes Yes Degree depends on mode and options.
No Arbitrary Limits Yes Yes All structure sizes are variably sized.
Random Access Yes Yes The Storage Layer adds stable pointer maintenance.
Robustness No Yes The Storage Layer adds the ability to detect and correct errors on a block basis.
Schema Instance Change Resilience Yes Yes Instances can be self-describing and schemas are represented as dynamic metadata.  
Signable Yes Yes Direct encryption of self-contained subtrees and native and XML canonicalization.
Specialized Codecs Yes Yes Typed data allows for specialized codecs as plugins or layers.
Support for Error Correction No Yes The Storage Layer adds the ability to detect and correct errors on a block basis.
Single Conformance Class Yes Yes All libraries should be able to read all versions.

Under consideration

References

(References listed here are not complete in this version.)

[1] http://asn1.elibel.tm.fr/en/biblio/article-for-ComNet.htm

[2] http://www.w3.org/XML/Binary/

[3] http://esxml.org

[4] Differential Serialization for Optimized SOAP Performance, http://grid.cs.binghamton.edu/projects/publications/bSOAP1-HPDC13/bSOAP1-HPDC13.pdf