Thursday, February 26, 2015

WatSON Data Types.

In my last post, I mentioned what I was calling the type marker:

<Type-marker> ::= <size-type> <data-type>

b7 b6 b5 b4 b3 b2 b1 b0

That post was dedicated to the highest 2 bits of the Type-Marker; the two bits that represent the size-type. This post is dedicated to the lower 6 bits: the data type. An MP4 atom uses 4 bytes to represent the data type (atom name). The size makes sense given the large, dynamic ecosystem that the specification is trying to support. Interestingly, the convention is not to describe the atom names as 4 byte integers. They are almost always referred to by their ascii representation. For example, the "Movie" atom is "\x6D\x6F\x6F\x76", which, if treated as a character string is "moov" (pronounced "Moo-V").

I like the idea of numeric identifiers having useful printed representations, so I copied that concept into the type markers for WatSON. For example, the single byte types of null, false, and true will be represented as the following:

<empty-false-type> ::= '0' ;; st == 00, dt == 110000
<empty-true-type> ::= '1' ;; st == 00, dt == 110001
<empty-null-type> ::= '?' ;; st == 00, dt == 111111

This manages to combined the size type and the data type into single character type marker that matches the convention for the type. This only holds true in the most common representation. Someone could create a short false type with an 8-bit length, similar to the following.

<short-false-type> ::= 'p' ;; st == 01, dt == 110000

The character 'p' doesn't represent false for me, but I see no reason to create rules preventing that ingredient. Using 2 bytes to store false is a waste of space. but it should not break parsing.

Going with this expectation about common sizing, I selected the following values for the lower six bits:

<simple-data-type> ::= 0x30 ;; False type
  | 0x31 ;; True type
  | 0x3F ;; Null type
  | 0x24 ;; Float type
  | 0x29 ;; 32-bit signed integer type
  | 0x2C ;; 64-bit signed integer type
  | 0x35 ;; 64-bit unsigned integer type
  | 0x22 ;; Bit-flags type
  | 0x33 ;; String type
  | 0x08 ;; Header type
  | 0x0C ;; Library type
  | 0x03 ;; Container type
  | 0x1A ;; Compressed container.
  | 0x0D ;; Map type
  | 0x02 ;; User defined binary type

I break them into two categories: empty and short. I’ll start by repeating the empty ingredient types from above:

<false-ingredient> ::= '0'
<true-ingredient> ::= '1'
<null-ingredient> ::= '?'

Then I have the simple short ingredients:

<double-ingredient> ::= ‘f’ ‘\x0A' <8-bytes-data>
<32-bit-int-ingredient> ::= ‘i’ ‘\x06' <4-bytes-data>
<64-bit-int-ingredient> ::= ‘l’ ‘\x0A’ <8-bytes-data>
<64-bit-uint-ingredient> ::= ‘u’ ‘\x0A’ <8-bytes-data>
<bit-flags-ingredient> ::= ‘b’ <8-bit-size> <data>*
<string-ingredient> ::= ’s’ <8-bit-size> <data>*

Last, I have the composite ingredients. I went with short sizes on these, mostly because of the lack of diversity above 127. My hand crafted WatSON documents are all less than 256 bytes, so the short sizing may be biased or flawed.

<header-ingredient> ::= ‘H’ <8-bit-size> <data>*
<library-ingredient> ::= ‘L’ <8-bit-size> <data>*
<container-ingredient> ::= ‘C’ <8-bit-size> <data>*
<compressed-ingredient> ::= ‘Z’ <8-bit-size> <data>*
<map-ingredient> ::= ‘M’ <8-bit-size> <data>*
<bytes-ingredient> ::= ‘B’ <8-bit-size> <data>*

I hope no one ever has to hand craft a file or see the characters, but I like that they make sense in smaller documents. For larger documents, you are probably going to need a tool or program to keep track of structure, so the letters on the composite ingredients are less important..

I think the next post will be focused on the composite ingredients. Specifically the containers.

Friday, February 20, 2015

WatSON Size and Type Specification

In trying to understand where I am going with the WatSON specification, it is useful to have some background on the MP4 file specification. Atomic Parsley provides a mostly easy to digest background on MP4 atoms.

In the first 8 bytes of every atom, you have enough context to either skip over the atom, or dive deeper into the atom. I wanted to create a specification that allowed the same type of flexibility. A specification where the fundamental component of the file format is simple, but easily extensible. I am trying to keep the size of the format down as well, so I wanted to come up with a model that lets me represent types like "true" and "false" in a single byte.

Note, the names used after here are just place holders. I have been more interested in the format concept than naming at this point:

<Ingredient> ::= <Type-Marker> [<Size> <data>*]

Every ingredient starts with a Type-Marker. Type Markers are a single byte with two components. The 6 lowest bits determine the data-type. This would be similar to the atom name in MP4 files. It basically tells the parser what to expect inside the data section.

The highest 2 bits of the Type-Marker represent the size-type. The size type describes how large the Size value will be. MP4 doesn't have a similar concept. Sizes are always 4 bytes long, and special sizes are used to communicate non-standard sizes.


The size type is basically a way to help reduce the overhead of smaller ingredients. Smaller types, like numbers, use 8-bit sizes, while larger types like long-strings and big-containers use a 64-size. Here is an example how a string ingredient could use the different size types.

<empty-string> ::= '\x33' ; st bits == 00, dt bits == 110011
<short-string> ::= 's' <8-bit-size> <data>* ; st == 01, dt == 110011
<med-string> ::= '\xB3' <16-bit-size> <data>* ; st == 10, dt == 110011
<long-string> ::= '\xF3' <64-bit-size> <data>* ; st == 11, dt == 110011

The overhead for storing different types is 1 byte, 2 bytes, 3 bytes and 9 bytes. An empty string is represented by a single byte, with no size data following. The other string types have a required size component of various lengths. String data in WatSON will not be null terminated.

For the most common cases, this uses less space than storing a string in bson format, which has a fixed 6 byte overhead (1 for type, 4 for size, 1 for null-terminator). For strings longer than 65k, it has a larger 9 byte overhead, but can also store strings significantly larger than 4 gigabytes.

I haven't decided how I want to flag compression requirements on 64-bit sizes. I was thinking of maybe having the 64-bit size be signed (negative being compressed), or maybe reserving the highest bits for special flags like encryption and compression. Another idea I am toying around with is a "compressed container", such that Ingredients themselves aren't compressed, but they exist in a container that is compressed.

All of this is draft ideas at this point, but I am looking for some feedback.

Thursday, February 19, 2015

Semper Fidelis

Semper Fidelis is a latin phrase that roughly translates to “always faithful”. It is also the Marine Corps motto. When I was 19 and going through bootcamp, I would always ask “faithful to what?” I always received “Corps and country” as the response. Like all mottos that survive for 100 years, it is necessarily vague and open for a lot of interpretation. 

Over the years, I have decided that “faithful” applies to a core set of ideals. Consistency, transparency, respect, and mentoring are the foundation of my “always faithful”. People classify leaders as irresponsible if they can easily identify a scenario where you have betrayed any of those four ideals. Arguably, this is the human nature behind the “shame-ification” of the internet: one thoughtless joke on twitter classifies a person as worthless in every aspect of life. Consistency is that important in our new “social” society.

Shakespeare wrote “to thine own self be true,” and this is the core of being faithful. I start by identifying what I believe and how I want to be perceived. I must understand the consequences I am selecting and own them in good times and bad. I put those decisions on display so that others can see and interpret them, especially the ownership and repercussions. Transparency in decisions and owning the consequences are aspects I want in leaders, and I must demonstrate the behaviors I want.

Being purely true to “thine own self” is problematic. Focusing internally creates an island of isolation. Islands don’t work well on teams, and they make for horrible leaders. Overcoming the island problem requires respecting that others are also trying to find and be true to themselves. Just as I have different experience at 35 than I did at 19, others are learning through their own experience. I nurture my respect for others through mentoring and trying to learn from their experience. I freely share what I have learned from my experiences. Helping others find their own definition of “always faithful”.

I don’t think faithfulness to organizations, projects, or people exists anymore. I view the “Corps and country” answer from boot camp to be naive. Other people, no matter what their role or their previous success are still learning. They all suffer from the same cognitive biases that plague all people and leaders. Ideals are less prone to human basis, and as demonstrated with a phrase like “Semper Fidelis”, they can evolve as the situation changes.

Find what you want to be remembered as, and always be faithful to those ideals.

Thursday, February 5, 2015

Something to replace BSON.

I have been working on my own implementation of the BSON Spec for about 3 years now. I have come to the conclusion that the BSON design seems haphazard and organic. The parsing is harder and more intensive than it should be. Scanning it is poorly designed. The specific types require too much knowledge to handle properly. They even continue making small changes to the specification, but not bumping the version number.

In order to solve my problems with it, I have decided to create my own specification and call it WatSON. Well, technically, I am calling it “왓슨”, but I don’t know how to type that on my English (US) keyboard.

Here are some of my design goals.

  • Keys are defined at the start of the document to eliminate repetitive keys.
  • Arrays don’t have keys at all.
  • Types and Objects are connected, rather than being separated by the key name.
  • Containers and strings have 64 bit size markers. Other types have 1 byte size markers. The format must have a simple rule for skipping elements that don’t match a known type.
  • Document format is little-endian.
  • String formats support Snappy compression.
  • Format supports header information to influence how the document is read.

A lot of my influence in the design is coming from dealing with atoms in the MP4 file format. 

For more detail on how things are coming together: