Tuesday, March 3, 2015

WatSON Composite Ingredients

Simple ingredients represent a single value, like a number or a string. My last post covered the basic details for simple ingredients. Composite ingredients are designed to contain other ingredients or multiple values. Some are used to change how things are written to the file, like the Compressed Ingredient. Others are designed to provide structure to the file like the Container and Map Ingredients.

For simplicity, I am listing only the 8-bit sizes for the Ingredients, but the structure is valid for any size type.

Byte Reduction Ingredients


The library and compressed ingredients are designed to reduce the number of bytes required by data stored in WatSON format.

Library ingredients contain strings that are used elsewhere in the document. This primarily applies to keys for the map ingredient, but also applies to bytes ingredients. Libraries use a zero based index (the first element is index zero), but the string at index zero is always an empty string. The empty string is because the index zero is reserved.. See the map and byte ingredients for more details.

Library scope is going to be mentioned in the description for byte and map ingredients. Right now I am defining that as the nearest library in a parent container, although scope should get its own dedicated post in the future.

<library-ingredient> ::= ‘L’ <8-bit-size> <empty-string-ingredient> <string-ingredient>*

Compressed ingredients contain a single ingredient that has been compressed. I toyed around with the idea of making them a container as well, but I felt the single ingredient child would make implementations more straightforward at the cost of a few extra bytes.

I am thinking of using Snappy for the compression method. I haven't decided how flexible that will be in the future.

<compressed-ingredient> ::= ‘Z’ <8-bit-size> <data>*

Structure Ingredients


Container and map ingredients are designed for nesting and providing structure a WatSON document.

A container contains other ingredients. It is used for nesting and grouping ingredients. Think of it as a vector or list. Nothing inside a WatSON document references positions in a container. Order will probably be important. especially with regard to library and header ingredients. 

<container-ingredient> ::= ‘C’ <8-bit-size> <ingredient>*

A map ingredient is a key value structure. The keys are 32 bit unsigned integers. Positive keys reference the in scope library. A key of zero is reserved for an empty string key. I think I will be using that key as optional metadata, but I haven’t thought through what and how that metadata will be used. That will probably get a dedicated post in the future.

<map-ingredient> ::= ‘M’ <8-bit-size> <map-data>*
<map-data> ::= <uint32> <ingredient>

Extension Ingredients


Binary and header ingredients change how a parser should interpret WatSON data that follows.

Binary ingredients store opaque binary data. A positive marshal hint is a reference into the in scope library. A marshal hint of zero is reserved for an undefined marshal hint. I am thinking of the marshal hint as a place to store the run-time type. WatSON doesn’t specify anything about the data.

<bytes-ingredient> ::= ‘B’ <8-bit-size> <marshall-hint> <data>*

Headers are string based maps that contain information about the file contents. They are optional ingredients, and used to document requirements for parsing the rest of the file. An example would be the character encoding for strings, although I am leaning towards utf-8 being mandatory. They can also be used to document schema information or metadata like the program that generated the file, etc.

<header-ingredient> ::= ‘H’ <size> <header-data>*
<header-data> ::= <c-string> <ingredient>

I like how the format is coming together. I have my incomplete reference implementation foundation, as rough as it is, checked in on github.

Thursday, February 26, 2015

WatSON Data Types.

In my last post, I mentioned what I was calling the type marker:

<Type-marker> ::= <size-type> <data-type>

Size
Type
Data
Type
b7 b6 b5 b4 b3 b2 b1 b0

That post was dedicated to the highest 2 bits of the Type-Marker; the two bits that represent the size-type. This post is dedicated to the lower 6 bits: the data type. An MP4 atom uses 4 bytes to represent the data type (atom name). The size makes sense given the large, dynamic ecosystem that the specification is trying to support. Interestingly, the convention is not to describe the atom names as 4 byte integers. They are almost always referred to by their ascii representation. For example, the "Movie" atom is "\x6D\x6F\x6F\x76", which, if treated as a character string is "moov" (pronounced "Moo-V").

I like the idea of numeric identifiers having useful printed representations, so I copied that concept into the type markers for WatSON. For example, the single byte types of null, false, and true will be represented as the following:

<empty-false-type> ::= '0' ;; st == 00, dt == 110000
<empty-true-type> ::= '1' ;; st == 00, dt == 110001
<empty-null-type> ::= '?' ;; st == 00, dt == 111111

This manages to combined the size type and the data type into single character type marker that matches the convention for the type. This only holds true in the most common representation. Someone could create a short false type with an 8-bit length, similar to the following.

<short-false-type> ::= 'p' ;; st == 01, dt == 110000

The character 'p' doesn't represent false for me, but I see no reason to create rules preventing that ingredient. Using 2 bytes to store false is a waste of space. but it should not break parsing.

Going with this expectation about common sizing, I selected the following values for the lower six bits:

<simple-data-type> ::= 0x30 ;; False type
  | 0x31 ;; True type
  | 0x3F ;; Null type
  | 0x24 ;; Float type
  | 0x29 ;; 32-bit signed integer type
  | 0x2C ;; 64-bit signed integer type
  | 0x35 ;; 64-bit unsigned integer type
  | 0x22 ;; Bit-flags type
  | 0x33 ;; String type
  | 0x08 ;; Header type
  | 0x0C ;; Library type
  | 0x03 ;; Container type
  | 0x1A ;; Compressed container.
  | 0x0D ;; Map type
  | 0x02 ;; User defined binary type

I break them into two categories: empty and short. I’ll start by repeating the empty ingredient types from above:

<false-ingredient> ::= '0'
<true-ingredient> ::= '1'
<null-ingredient> ::= '?'

Then I have the simple short ingredients:

<double-ingredient> ::= ‘f’ ‘\x0A' <8-bytes-data>
<32-bit-int-ingredient> ::= ‘i’ ‘\x06' <4-bytes-data>
<64-bit-int-ingredient> ::= ‘l’ ‘\x0A’ <8-bytes-data>
<64-bit-uint-ingredient> ::= ‘u’ ‘\x0A’ <8-bytes-data>
<bit-flags-ingredient> ::= ‘b’ <8-bit-size> <data>*
<string-ingredient> ::= ’s’ <8-bit-size> <data>*

Last, I have the composite ingredients. I went with short sizes on these, mostly because of the lack of diversity above 127. My hand crafted WatSON documents are all less than 256 bytes, so the short sizing may be biased or flawed.

<header-ingredient> ::= ‘H’ <8-bit-size> <data>*
<library-ingredient> ::= ‘L’ <8-bit-size> <data>*
<container-ingredient> ::= ‘C’ <8-bit-size> <data>*
<compressed-ingredient> ::= ‘Z’ <8-bit-size> <data>*
<map-ingredient> ::= ‘M’ <8-bit-size> <data>*
<bytes-ingredient> ::= ‘B’ <8-bit-size> <data>*

I hope no one ever has to hand craft a file or see the characters, but I like that they make sense in smaller documents. For larger documents, you are probably going to need a tool or program to keep track of structure, so the letters on the composite ingredients are less important..

I think the next post will be focused on the composite ingredients. Specifically the containers.

Friday, February 20, 2015

WatSON Size and Type Specification

In trying to understand where I am going with the WatSON specification, it is useful to have some background on the MP4 file specification. Atomic Parsley provides a mostly easy to digest background on MP4 atoms.

In the first 8 bytes of every atom, you have enough context to either skip over the atom, or dive deeper into the atom. I wanted to create a specification that allowed the same type of flexibility. A specification where the fundamental component of the file format is simple, but easily extensible. I am trying to keep the size of the format down as well, so I wanted to come up with a model that lets me represent types like "true" and "false" in a single byte.

Note, the names used after here are just place holders. I have been more interested in the format concept than naming at this point:

<Ingredient> ::= <Type-Marker> [<Size> <data>*]

Every ingredient starts with a Type-Marker. Type Markers are a single byte with two components. The 6 lowest bits determine the data-type. This would be similar to the atom name in MP4 files. It basically tells the parser what to expect inside the data section.

The highest 2 bits of the Type-Marker represent the size-type. The size type describes how large the Size value will be. MP4 doesn't have a similar concept. Sizes are always 4 bytes long, and special sizes are used to communicate non-standard sizes.

Size
Type
Data
Type
b7b6b5b4b3b2b1b0

The size type is basically a way to help reduce the overhead of smaller ingredients. Smaller types, like numbers, use 8-bit sizes, while larger types like long-strings and big-containers use a 64-size. Here is an example how a string ingredient could use the different size types.

<empty-string> ::= '\x33' ; st bits == 00, dt bits == 110011
<short-string> ::= 's' <8-bit-size> <data>* ; st == 01, dt == 110011
<med-string> ::= '\xB3' <16-bit-size> <data>* ; st == 10, dt == 110011
<long-string> ::= '\xF3' <64-bit-size> <data>* ; st == 11, dt == 110011

The overhead for storing different types is 1 byte, 2 bytes, 3 bytes and 9 bytes. An empty string is represented by a single byte, with no size data following. The other string types have a required size component of various lengths. String data in WatSON will not be null terminated.

For the most common cases, this uses less space than storing a string in bson format, which has a fixed 6 byte overhead (1 for type, 4 for size, 1 for null-terminator). For strings longer than 65k, it has a larger 9 byte overhead, but can also store strings significantly larger than 4 gigabytes.

I haven't decided how I want to flag compression requirements on 64-bit sizes. I was thinking of maybe having the 64-bit size be signed (negative being compressed), or maybe reserving the highest bits for special flags like encryption and compression. Another idea I am toying around with is a "compressed container", such that Ingredients themselves aren't compressed, but they exist in a container that is compressed.

All of this is draft ideas at this point, but I am looking for some feedback.

Thursday, February 19, 2015

Semper Fidelis

Semper Fidelis is a latin phrase that roughly translates to “always faithful”. It is also the Marine Corps motto. When I was 19 and going through bootcamp, I would always ask “faithful to what?” I always received “Corps and country” as the response. Like all mottos that survive for 100 years, it is necessarily vague and open for a lot of interpretation. 

Over the years, I have decided that “faithful” applies to a core set of ideals. Consistency, transparency, respect, and mentoring are the foundation of my “always faithful”. People classify leaders as irresponsible if they can easily identify a scenario where you have betrayed any of those four ideals. Arguably, this is the human nature behind the “shame-ification” of the internet: one thoughtless joke on twitter classifies a person as worthless in every aspect of life. Consistency is that important in our new “social” society.

Shakespeare wrote “to thine own self be true,” and this is the core of being faithful. I start by identifying what I believe and how I want to be perceived. I must understand the consequences I am selecting and own them in good times and bad. I put those decisions on display so that others can see and interpret them, especially the ownership and repercussions. Transparency in decisions and owning the consequences are aspects I want in leaders, and I must demonstrate the behaviors I want.

Being purely true to “thine own self” is problematic. Focusing internally creates an island of isolation. Islands don’t work well on teams, and they make for horrible leaders. Overcoming the island problem requires respecting that others are also trying to find and be true to themselves. Just as I have different experience at 35 than I did at 19, others are learning through their own experience. I nurture my respect for others through mentoring and trying to learn from their experience. I freely share what I have learned from my experiences. Helping others find their own definition of “always faithful”.

I don’t think faithfulness to organizations, projects, or people exists anymore. I view the “Corps and country” answer from boot camp to be naive. Other people, no matter what their role or their previous success are still learning. They all suffer from the same cognitive biases that plague all people and leaders. Ideals are less prone to human basis, and as demonstrated with a phrase like “Semper Fidelis”, they can evolve as the situation changes.

Find what you want to be remembered as, and always be faithful to those ideals.

Friday, February 6, 2015

Something to replace BSON.

I have been working on my own implementation of the BSON Spec for about 3 years now. I have come to the conclusion that the BSON design seems haphazard and organic. The parsing is harder and more intensive than it should be. Scanning it is poorly designed. The specific types require too much knowledge to handle properly. They even continue making small changes to the specification, but not bumping the version number.

In order to solve my problems with it, I have decided to create my own specification and call it WatSON. Well, technically, I am calling it “왓슨”, but I don’t know how to type that on my English (US) keyboard.

Here are some of my design goals.

  • Keys are defined at the start of the document to eliminate repetitive keys.
  • Arrays don’t have keys at all.
  • Types and Objects are connected, rather than being separated by the key name.
  • Containers and strings have 64 bit size markers. Other types have 1 byte size markers. The format must have a simple rule for skipping elements that don’t match a known type.
  • Document format is little-endian.
  • String formats support Snappy compression.
  • Format supports header information to influence how the document is read.

A lot of my influence in the design is coming from dealing with atoms in the MP4 file format. 

For more detail on how things are coming together:

Friday, July 18, 2014

Connection Pools.

While thinking through where I want to take Logjammin', given it has been neglected for a while, I was contemplating the different types of connection pools that a server has, and basically broke it down in to three different pools:
  1. Connections to the clients.
  2. Connections to peers.
  3. Connections to other (dependent) services.
Now, I can easily argue the first set isn't a "pool" in the traditional sense, but it is a collection of connections that must be managed in some fashion. The second one is normally used for replication or synchronization. And the last one is the one we normally make connection pools for -- typically for database connections.

Take a database server for example. Knowing your clients will be intentionally keeping long running connections, it makes sense to pre-allocate resources for connections or keep the connections open and idle. The only real difference on the two sides is who initiated the connection. Keep-alive, long polling, and server sent events, the client as a connection pool starts to make a lot more sense.

That leaves the peers. I like to think of peers as essentially client AND server initiated connection pools. They require solutions on both sides, as some peer will connect to them, and they, in term, will connect to some other peer.

Right now, I am looking at the logjammin' code, neglected for a year, and questioning why I treat all three pools distinctly different? The Connections to clients is handled through the normal listen. The connection to the peers is handled in a different thread. And the dependent connections aren't implemented at all.

I have another trip to Dulles this weekend, and I want to play around with creating a pool of connections concept to replace the "Server" concept I have built right now. Something like each Logjammin' instance is a collection of 1 or more connection pools which deal with resource management, and initiating a configurable series of stages or events.

Update 10/10/2014: It took a while, but finally pushed it.

Saturday, September 22, 2012

logjammin update

I moved logjammin over to github from bitbucket. Had more to do with the usual social networking stuff -- people I wanted to work with are on github, don't really use bitbucket).

In addition to that, I have been doing some major surgery on the project to change out some of the dependencies. Specifically, I have switched out CryptoPP for the Nettle cryptography library, and I am eventually going to switch out OpenSSL for gnutls.

On the nettle front, I had a hard time finding good examples for the GCM mode encryption. I had to piece things together form the information in the Nettle documentation, source, and playing with results. You can see the resulting encrypt and decrypt methods in the lj::Document class.

The most confusing part, at least for me, was that in GCM mode, everything uses the encrypt function of the underlying cipher. If you don't pay attention and accidentally pass in a decryption function, it will not work as expected. I also, stupidly, thought I had to chunk my inputs into the block size. I didn't realize until the end that I could just pass the entire message in and get the expected result. You only need to chunk up the message if you need to call encrypt / decrypt multiple times.