Friday, February 20, 2015

WatSON Size and Type Specification

In trying to understand where I am going with the WatSON specification, it is useful to have some background on the MP4 file specification. Atomic Parsley provides a mostly easy to digest background on MP4 atoms.

In the first 8 bytes of every atom, you have enough context to either skip over the atom, or dive deeper into the atom. I wanted to create a specification that allowed the same type of flexibility. A specification where the fundamental component of the file format is simple, but easily extensible. I am trying to keep the size of the format down as well, so I wanted to come up with a model that lets me represent types like "true" and "false" in a single byte.

Note, the names used after here are just place holders. I have been more interested in the format concept than naming at this point:

<Ingredient> ::= <Type-Marker> [<Size> <data>*]

Every ingredient starts with a Type-Marker. Type Markers are a single byte with two components. The 6 lowest bits determine the data-type. This would be similar to the atom name in MP4 files. It basically tells the parser what to expect inside the data section.

The highest 2 bits of the Type-Marker represent the size-type. The size type describes how large the Size value will be. MP4 doesn't have a similar concept. Sizes are always 4 bytes long, and special sizes are used to communicate non-standard sizes.

Size
Type
Data
Type
b7b6b5b4b3b2b1b0

The size type is basically a way to help reduce the overhead of smaller ingredients. Smaller types, like numbers, use 8-bit sizes, while larger types like long-strings and big-containers use a 64-size. Here is an example how a string ingredient could use the different size types.

<empty-string> ::= '\x33' ; st bits == 00, dt bits == 110011
<short-string> ::= 's' <8-bit-size> <data>* ; st == 01, dt == 110011
<med-string> ::= '\xB3' <16-bit-size> <data>* ; st == 10, dt == 110011
<long-string> ::= '\xF3' <64-bit-size> <data>* ; st == 11, dt == 110011

The overhead for storing different types is 1 byte, 2 bytes, 3 bytes and 9 bytes. An empty string is represented by a single byte, with no size data following. The other string types have a required size component of various lengths. String data in WatSON will not be null terminated.

For the most common cases, this uses less space than storing a string in bson format, which has a fixed 6 byte overhead (1 for type, 4 for size, 1 for null-terminator). For strings longer than 65k, it has a larger 9 byte overhead, but can also store strings significantly larger than 4 gigabytes.

I haven't decided how I want to flag compression requirements on 64-bit sizes. I was thinking of maybe having the 64-bit size be signed (negative being compressed), or maybe reserving the highest bits for special flags like encryption and compression. Another idea I am toying around with is a "compressed container", such that Ingredients themselves aren't compressed, but they exist in a container that is compressed.

All of this is draft ideas at this point, but I am looking for some feedback.