Sunday, July 13, 2008

A Scalable Red Black Tree Based Structure with Hashing

I've been trying to figure out the best way to write scalable shared data structures in Erlang. In my last post, I talked about writing one based on regular binary search trees. That implementation would create a number of processes, and divide a pre-determined key range onto those processes.

That implementation suffered from a few problems; first of all, the keys had to be integers, the range of those integers had to be known. It performed well and scaled well as long as the keys were well distributed; but if they weren't well distributed, and in the real world they probably aren't, then things would get ugly.

Fortunately the solution to all these problems turned out to be quite simple. In that last post, autrepack commented that I should use a hash (I'm glad I have a blog), and that's what I did.

No need to know the range of the keys, I just create as many processes as needed and whenever a request comes along, I hash its key using erlang:phash2 and based on that I decide which bucket (process) needs to handle it.

This automatically solved three problems; I no longer am limited to integers as key values, I don't need to predetermine the range of my keys, and load should be balanced (since I'm assuming that phash2 is a good hash function). Three birds in one stone - merci autrepack!

Also, to balance things a bit more, instead of using normal binary search trees, the data structure is now based on red black trees. The cool thing about them is that key keep the tree kinda balanced and offer worst-case scenario performance guarantees.

The result of my work can be found at:-

I haven't tested this yet, but I imagine that under the tests I already have it would perform slightly worse than before - that's because my previous tests used uniformly generated random numbers, which pretty much guarantees that everything is well distributed - therefore the overhead of using a red black tree isn't utilized. However, I believe than under real world load this should perform better

The cool thing is, this really was easy to write compared to implementing a similar solution in other languages.

Thursday, July 10, 2008

A Scalable Binary Search Tree Based Structure in Erlang

In my previous post I talked about writing a scalable binary search tree, and the difficulties I was having.

I believe I finally managed to create the structure I wanted to. The main problem that I was having is that I was still thinking in C (or sequential languages), so what I wanted to do was build a single binary search tree (bst) that scales well. What I realized though, is that rather than doing that, I should create N bsts where N is proportional to the number of processors available in the system.

So what I'm doing now is spawn a number of threads (servers), each thread is responsible for a bst that covers a range of keys, and whenever an operation is performed, the client sends it to the particular server that's responsible for the tree in that range. If this doesn't make sense, then maybe my code will make it clearer:- (code needs better error handling, general tidying up)

The main problem with this implementation though is that if the distribution of the keys isn't uniform within the expected range, it won't perform well. But I guess that is a problem with non-balanced bsts in general.

That said, this solution turned out to be simpler than other solutions I've attempted, performs better in the single-threaded case, and scales pretty well.

Here's a link to a graph that shows how well this scales on an 8 core machine:-

Compare with my previous attempt:-

Can't really ask for better than that.

What I need to do now is figure out a way to balance the load somehow when it's not uniform.

Monday, July 07, 2008

Problems Writing a Scalable Shared Data Structure in Erlang

[This is adapted from a post I sent to erlang-questions.]

A while back I posted a question asking how to write a scalable shared data structure in Erlang; it can be anything really but I decided to go with a binary search tree (bst) since it's a simple data structure that should scale well with load (as long as it's balanced). I talked about that attempt in my previous post.

Anyway, I finished writing the code, which I'm assuming (read: hoping) is correct. :) What I found though is that it's not scaling at all. By that I mean that as the number of threads/processes accessing the bst increases, throughput doesn't improve by as much.

I've implemented the nodes as processes, and all operations on the tree are relayed to the node that needs to do something about it. The code for this implementation can be found here:-

I did my tests on a Sun Fire V880 with 8 900-MHz UltraSPARC-III processors and 16 GBs of main memory; running Solaris 10 and Erlang (BEAM) 5.6.3.

My tests consisted of randomly creating and populating a tree that has a key range of 0-1000; and then performing at random (uniform distribution) either an insert(), delete() or a contains() done 100,000 times. I would vary the number of threads from 1 to 8, and the load (100,000 operations) would be divided by the number of processors available. I also performed the tests where the only operation performed is a contains() operation (designated by 1:0:0), or 8 contains() operations, 1 insert() and 1 delete() (8:1:1), or an equal mix of all three (1:1:1).

In a perfect world, 2 threads would do the job in half the time 1 thread would, but of course nothing is perfect. So as a baseline I performed another test whereby each thread would just call a function (repeated for a number of times) that does nothing but return true, to see how well that scales (in the graph: Best Scalability). This used the same functions to distribute the load, except that randOperation() would only return true rather than do anything useful.

Anyway, you can find the graph (where the results are normalized to the wall time for one processor) at:- (the lower the lines goes, better the scalability)

and the raw data (showing the wall time in ms) at:-

As you can see, my data structure isn't scaling well at all regardless of the kind of workload it has. I would expect it to scale well since the tree should be kind of balanced. In C I would know how to write an implementation where at least contains() scales well; writing something where tree modification scales is a bit more difficult but should be doable. However, I am new to Erlang and I can't really reason about all the issues well enough to pull off a similar implementation.

In a nutshell my question is; what am I doing wrong? Is there a better way to have a place that stores shared data in a scalable manner?