Thursday, February 26, 2015

WatSON Data Types.

In my last post, I mentioned what I was calling the type marker:

<Type-marker> ::= <size-type> <data-type>

b7 b6 b5 b4 b3 b2 b1 b0

That post was dedicated to the highest 2 bits of the Type-Marker; the two bits that represent the size-type. This post is dedicated to the lower 6 bits: the data type. An MP4 atom uses 4 bytes to represent the data type (atom name). The size makes sense given the large, dynamic ecosystem that the specification is trying to support. Interestingly, the convention is not to describe the atom names as 4 byte integers. They are almost always referred to by their ascii representation. For example, the "Movie" atom is "\x6D\x6F\x6F\x76", which, if treated as a character string is "moov" (pronounced "Moo-V").

I like the idea of numeric identifiers having useful printed representations, so I copied that concept into the type markers for WatSON. For example, the single byte types of null, false, and true will be represented as the following:

<empty-false-type> ::= '0' ;; st == 00, dt == 110000
<empty-true-type> ::= '1' ;; st == 00, dt == 110001
<empty-null-type> ::= '?' ;; st == 00, dt == 111111

This manages to combined the size type and the data type into single character type marker that matches the convention for the type. This only holds true in the most common representation. Someone could create a short false type with an 8-bit length, similar to the following.

<short-false-type> ::= 'p' ;; st == 01, dt == 110000

The character 'p' doesn't represent false for me, but I see no reason to create rules preventing that ingredient. Using 2 bytes to store false is a waste of space. but it should not break parsing.

Going with this expectation about common sizing, I selected the following values for the lower six bits:

<simple-data-type> ::= 0x30 ;; False type
  | 0x31 ;; True type
  | 0x3F ;; Null type
  | 0x24 ;; Float type
  | 0x29 ;; 32-bit signed integer type
  | 0x2C ;; 64-bit signed integer type
  | 0x35 ;; 64-bit unsigned integer type
  | 0x22 ;; Bit-flags type
  | 0x33 ;; String type
  | 0x08 ;; Header type
  | 0x0C ;; Library type
  | 0x03 ;; Container type
  | 0x1A ;; Compressed container.
  | 0x0D ;; Map type
  | 0x02 ;; User defined binary type

I break them into two categories: empty and short. I’ll start by repeating the empty ingredient types from above:

<false-ingredient> ::= '0'
<true-ingredient> ::= '1'
<null-ingredient> ::= '?'

Then I have the simple short ingredients:

<double-ingredient> ::= ‘f’ ‘\x0A' <8-bytes-data>
<32-bit-int-ingredient> ::= ‘i’ ‘\x06' <4-bytes-data>
<64-bit-int-ingredient> ::= ‘l’ ‘\x0A’ <8-bytes-data>
<64-bit-uint-ingredient> ::= ‘u’ ‘\x0A’ <8-bytes-data>
<bit-flags-ingredient> ::= ‘b’ <8-bit-size> <data>*
<string-ingredient> ::= ’s’ <8-bit-size> <data>*

Last, I have the composite ingredients. I went with short sizes on these, mostly because of the lack of diversity above 127. My hand crafted WatSON documents are all less than 256 bytes, so the short sizing may be biased or flawed.

<header-ingredient> ::= ‘H’ <8-bit-size> <data>*
<library-ingredient> ::= ‘L’ <8-bit-size> <data>*
<container-ingredient> ::= ‘C’ <8-bit-size> <data>*
<compressed-ingredient> ::= ‘Z’ <8-bit-size> <data>*
<map-ingredient> ::= ‘M’ <8-bit-size> <data>*
<bytes-ingredient> ::= ‘B’ <8-bit-size> <data>*

I hope no one ever has to hand craft a file or see the characters, but I like that they make sense in smaller documents. For larger documents, you are probably going to need a tool or program to keep track of structure, so the letters on the composite ingredients are less important..

I think the next post will be focused on the composite ingredients. Specifically the containers.