Software, Insanity and the Virtual: database

Showing posts with label database. Show all posts

Sunday, March 18, 2012

V8 and Lua

I haven't posted in a long time. I have been investing a lot of time into the nosql database I was working on. The biggest change their lately has been integrating v8 and lua into the code base. One of the things I really liked about integration of Lua was something called Lunar. It is an incredibly easy way to simplify Lua integration in c++. I don't remember all of the changes I have made to my version over years: stuff like meta method support, etc. If you are trying to do some Lua integration, it is a great place to start.

But in order to get some friends to help me work on Logjammin', I had to start incorporating v8 (silly JS developers). I have to say that I found certain parts of the v8 documentation and examples really hard to follow. After spending an entire afternoon just trying to get a print function into logjamd, I realized that it would be pretty straightforward to create a v8 version of Lunar. It is actually slightly easier because of the cached template objects that normally get used. And because I am currently reading Quicksilver by Neil Stephenson, I decided to call the helper file "Jesuit". To be honest, I was thinking of Janissaries, since that is how book two ends. But that was too long of a word.

Anyway. The Lua version is here:

https://bitbucket.org/ibd71mav/logjammin/src/f1c2487e2053/src/lua/lunar.h

And the v8 version is here:

https://bitbucket.org/ibd71mav/logjammin/src/f1c2487e2053/src/js/jesuit.h

Friday, May 28, 2010

Concurrency approaches

When you start developing a service or system, it is much easier to make it work as a single thread in a single process. Single threading allows rapid development of concept at the expense of concurrency. Single threaded just makes things so much more predictable.

But eventually a service needs the ability to handle more than a single user at a time. Typically, the process either forks itself or utilizes some threading library (like pthreads). Multithreading an application introduces new complexities. Things like IPC, mutexes, semaphores, race conditions, and atomicity become troublesome concepts. For some applications, multi-threading is the only way to go. Applications we use every day would be unusable as a single threaded application. Complex AJAX websites would be horrible -- the A in AJAX stands for Asynchronous after all.

Most things accomplished with threading can also be accomplished with multiple processes. IPC can be done through a resource other than memory. Creating a socket connection between two processes is a very simple method of implementing IPC between two processes and it avoids some of the complications inherent in semaphores. So why don't developers leverage this approach more often? It is typically viewed as wasteful of resources and it typically performs slower than two threads communicating inside the same process. For these reasons I have avoided processes unless the functionality is intended for physically different hosts.

Then Google Chrome came around and I had to partially rethink my above beliefs. Chrome is not a single process running multiple threads. It is a collection of multiple processes running multiple threads per process. Google decided more upfront memory consumption was an acceptable exchange for more crash isolation. Check out the Chrome book to see a crash course on chrome in a comic-book format.

So now I am trying to write a database service. Like all modern database servers, it needs to support replication. It particularly must support replication between hosts. In development, I normally run two processes on the same host and pretend those processes are running on different machines. It was at this point the lightbulb lit up and I started thinking about Chrome's architecture. What if a database server was a collection of single threaded processes running on a single host. The database equivalent of the prefork Apache MPM. The model is not new, and it isn't even unique, but it isn't common in the DB world. kdb+ is one of the few databases I have heard of that tries this approach.

Unlike the prefork model, I am thinking of leaving them completely unrelated processes. Essentially leveraging the replication functionality as my IPC model and leveraging some other technology to do the load balancing. The simplicity of the approach looks appealing at first glance. The theoretical redundancy and fault tolerance in the model is intriguing.

Monday, June 29, 2009

Tokyo Cabinet Tuning Part 1 - Bucket Array Size

I have been playing around with Tokyo Cabinet for a few weeks now, and I wanted to share some of the tuning hints I have found.

I was loading a database with just shy of two billion records, and speed would become unacceptably slow after about the 500 million mark. In order to improve the database performance, I have begun experimenting with different tuning options available through the tcbdbtune method. The first tuning option I experimented with was the number of members in the bucket array.

Putting records into the B+ tree database will be much slower than you expect unless you increase the number of elements in the bucket array. Some of my runs took over 30 minutes to load 100 million records. I performed over 200 tests with different bucket, leaf/non-leaf member values, and record counts. In the end I found the bucket array should be between one-tenth to six-tenths the expected data-set size. Anything smaller or larger results in longer loads. The leaf/non-leaf values had very little impact on the performance of linear record writing.

I am still collecting data on the performance of different leaf/non-leaf settings for random writes, and I will post about those findings in part 2.