Sunday, July 13, 2008

A Scalable Red Black Tree Based Structure with Hashing

I've been trying to figure out the best way to write scalable shared data structures in Erlang. In my last post, I talked about writing one based on regular binary search trees. That implementation would create a number of processes, and divide a pre-determined key range onto those processes.

That implementation suffered from a few problems; first of all, the keys had to be integers, the range of those integers had to be known. It performed well and scaled well as long as the keys were well distributed; but if they weren't well distributed, and in the real world they probably aren't, then things would get ugly.

Fortunately the solution to all these problems turned out to be quite simple. In that last post, autrepack commented that I should use a hash (I'm glad I have a blog), and that's what I did.

No need to know the range of the keys, I just create as many processes as needed and whenever a request comes along, I hash its key using erlang:phash2 and based on that I decide which bucket (process) needs to handle it.

This automatically solved three problems; I no longer am limited to integers as key values, I don't need to predetermine the range of my keys, and load should be balanced (since I'm assuming that phash2 is a good hash function). Three birds in one stone - merci autrepack!

Also, to balance things a bit more, instead of using normal binary search trees, the data structure is now based on red black trees. The cool thing about them is that key keep the tree kinda balanced and offer worst-case scenario performance guarantees.

The result of my work can be found at:-

I haven't tested this yet, but I imagine that under the tests I already have it would perform slightly worse than before - that's because my previous tests used uniformly generated random numbers, which pretty much guarantees that everything is well distributed - therefore the overhead of using a red black tree isn't utilized. However, I believe than under real world load this should perform better

The cool thing is, this really was easy to write compared to implementing a similar solution in other languages.

Thursday, July 10, 2008

A Scalable Binary Search Tree Based Structure in Erlang

In my previous post I talked about writing a scalable binary search tree, and the difficulties I was having.

I believe I finally managed to create the structure I wanted to. The main problem that I was having is that I was still thinking in C (or sequential languages), so what I wanted to do was build a single binary search tree (bst) that scales well. What I realized though, is that rather than doing that, I should create N bsts where N is proportional to the number of processors available in the system.

So what I'm doing now is spawn a number of threads (servers), each thread is responsible for a bst that covers a range of keys, and whenever an operation is performed, the client sends it to the particular server that's responsible for the tree in that range. If this doesn't make sense, then maybe my code will make it clearer:- (code needs better error handling, general tidying up)

The main problem with this implementation though is that if the distribution of the keys isn't uniform within the expected range, it won't perform well. But I guess that is a problem with non-balanced bsts in general.

That said, this solution turned out to be simpler than other solutions I've attempted, performs better in the single-threaded case, and scales pretty well.

Here's a link to a graph that shows how well this scales on an 8 core machine:-

Compare with my previous attempt:-

Can't really ask for better than that.

What I need to do now is figure out a way to balance the load somehow when it's not uniform.

Monday, July 07, 2008

Problems Writing a Scalable Shared Data Structure in Erlang

[This is adapted from a post I sent to erlang-questions.]

A while back I posted a question asking how to write a scalable shared data structure in Erlang; it can be anything really but I decided to go with a binary search tree (bst) since it's a simple data structure that should scale well with load (as long as it's balanced). I talked about that attempt in my previous post.

Anyway, I finished writing the code, which I'm assuming (read: hoping) is correct. :) What I found though is that it's not scaling at all. By that I mean that as the number of threads/processes accessing the bst increases, throughput doesn't improve by as much.

I've implemented the nodes as processes, and all operations on the tree are relayed to the node that needs to do something about it. The code for this implementation can be found here:-

I did my tests on a Sun Fire V880 with 8 900-MHz UltraSPARC-III processors and 16 GBs of main memory; running Solaris 10 and Erlang (BEAM) 5.6.3.

My tests consisted of randomly creating and populating a tree that has a key range of 0-1000; and then performing at random (uniform distribution) either an insert(), delete() or a contains() done 100,000 times. I would vary the number of threads from 1 to 8, and the load (100,000 operations) would be divided by the number of processors available. I also performed the tests where the only operation performed is a contains() operation (designated by 1:0:0), or 8 contains() operations, 1 insert() and 1 delete() (8:1:1), or an equal mix of all three (1:1:1).

In a perfect world, 2 threads would do the job in half the time 1 thread would, but of course nothing is perfect. So as a baseline I performed another test whereby each thread would just call a function (repeated for a number of times) that does nothing but return true, to see how well that scales (in the graph: Best Scalability). This used the same functions to distribute the load, except that randOperation() would only return true rather than do anything useful.

Anyway, you can find the graph (where the results are normalized to the wall time for one processor) at:- (the lower the lines goes, better the scalability)

and the raw data (showing the wall time in ms) at:-

As you can see, my data structure isn't scaling well at all regardless of the kind of workload it has. I would expect it to scale well since the tree should be kind of balanced. In C I would know how to write an implementation where at least contains() scales well; writing something where tree modification scales is a bit more difficult but should be doable. However, I am new to Erlang and I can't really reason about all the issues well enough to pull off a similar implementation.

In a nutshell my question is; what am I doing wrong? Is there a better way to have a place that stores shared data in a scalable manner?

Monday, June 30, 2008

Concurrent and Shared Binary Search Tree in Erlang

So, wanting to work a bit more with Erlang, and to try and get a better understanding of how to write parallel programs in Erlang, I decided to write a concurrent Binary Search Tree (a basic one, not balanced) and see how that goes. I though a binary search tree would be a good candidate since operations normally affect only a small subset of the tree, so quite a bit of concurrency should be doable. Of course the more balanced the tree the better the expected concurrency.

Since I'm new to Erlang, I started by trying to parallelize a simple Erlang implementation, similar to the one in Concurrent Programming in Erlang. This implementation is essentially a tuple, and adding or removing nodes consists of modifying the tuple.

What I did was create a process that hold the current state of the BST tuple, and waits for messages that tells it what to do. It could either insert, delete or lookup a node in the tree. Of course, this basic implementation would process such requests sequentially.

I think I managed to successfully parallize the operation, since I would spawn a new process to search through the tree, but I couldn't figure out a way to parallelize insert or delete. Spawning a new processes to modify the tree in the same manner wouldn't work since I need all the modifications to be consolidated somewhere. But at least this implementation should be able to have multiple lookups running in parallel with at most one insert/delete operation.

Here's a link to the code I've mentioned above.

* * *

So I asked the Erlang questions mailing list what the best way to go about solving the problem would be, and the answer I got is that I need to think of each node as a process in itself. While this isn't how you'd tack such a problem in C, in Erlang processes are quite cheap.

So each node is a spawned instance of the following function (sorry about the lack of proper indentation):-

% Maintains the state of the node
nodeState(Node) ->
{contains, Key, Requester} ->
contains(Key, Node, Requester),

{insert, Key} ->
nodeState(insert(Key, Node));

{delete, Key} ->
nodeState(delete(Key, Node));

{node, Requester} ->
Requester ! Node,

{dump, Requester} ->
dump(Node, Requester),

Whenever a node gets a request, it either processes it itself if it has the correct key, or it passes it on down the tree.

The complete code is here.

Implementing contains and insert were pretty straightforward since data flows in only one direction (down the tree). delete was a bit more challenging, since when deleting a node that has children, you need to find a good replacement for it and delete the replacement (i.e. replace it with the maximal node in it's left subtree and then delete that node instead). This means that messages can be exchanged in the opposite direction, which might create some potential for deadlock.

That said, I have to admit that it was significantly easier writing this kind of algorithm in Erlang as it would have been in C; that is of couse assuming that my code is correct.

A note about the interface (the rpcs for insert, delete, contains); no acknowledgement is sent after an insert or a delete. contains of course waits until a response is received. I hope that this implementation is correct in the sense that if something is inserted, then contains is called the response should be positive unless another process deleted what was inserted and vice versa.

Now what's still remaining is performing a sanity test to see whether my implementation is correct (Dijkstra: testing can be used to show the presence of bugs, but never to show their absence). But that's what I can do for the time being.

I also need to figure out a way to write a benchmark to test performance. This is going to be tricky since processes do not acknowledge that operations have been completed. I could add that, but by adding that I would be adding more overhead....

So, what I like about this implementation is that it was relatively easy to reason about; what I don't like is the overhead of a process per node. I recall reading that an Erlang process uses up something in the neighbourhood of 200 bytes; so if the data stored per process is small, this overhead is substantial.

Next step; write a concurrent Red-Black tree!

Monday, June 16, 2008

Full Red Black Tree Implementation in Erlang

In my previous post I talked about Red Black trees and how to implement them in Erlang. I only dealt with inserting into a Red Black tree. I finally got around to finalizing this implementation by adding the code for deleting from a Red Black tree.

There were some challenges; the ones that stand out the most are:-
  • The algorithm for deleting is not an easy one, and I actually had to understand it this time (as opposed to just rewriting pseudocode into C as when I did the C implementation). This link, that Robert Virding pointed out to me was really helpful in this respect.
  • The fact that variables are immutable in Erlang, and the whole functional programming way of thinking is still not coming naturally to me. I guess I need to practice more.
  • Erlang doesn't allow functions as guards, even if those functions don't have side effects. You'd think it would be trivial to actually have the compiler check for that or for the VM to throw an exception at runtime if it encounters a guard with side-effects; but apparently that's completely not allowed for historical reasons or something. Fortunately with help from the erlang questions list, I was able to get around this limitation by using macros.
With all of that being said, and with the learning curve involved, I found it significantly easier to write what seems to be a correct implementation of a Red Black Tree in Erlang than it was in C. The code is significantly more concise and easier to read/understand I believe.

For those who are interested, here's a link to my full Erlang Red Black Tree implementation. As I mentioned before, I'm still new to Erlang, so I would greatly appreciate any comments you might have on my code; whether it's related to style or the actual algorithm. That said, I know a couple of places where my code could be optimised, but I was aiming for clarity. But if you have any suggestions that could improve both, then I'd like to know about them.

Now off to do some work...

Monday, June 09, 2008

Erlang and Red-black Trees

Update: Full RBTree implementation in Erlang.

Recently I've been trying to pick up Erlang, a message-based concurrent functional programming language. The main difficulty I'm having is wrapping my head around the whole functional thing and dealing with single assignment (immutable data). I thought what better way to try and get it than to implement my favourite data structure - the venerable Red-black tree!

The RB Tree is a self-balancing binary tree, and all the colouring and rebalancing business can be a bit tricky. To give you an idea, my C implementation is roughly around 1000 lines of code (including comments/spaces).

To be honest, it hasn't been easy. Fortunately, I stumbled upon this entry which mentioned a functional implementation in Haskell. With my good friend Nick's help in understanding the Haskell bit, I managed to whip up an Erlang implementation in less than an hour and less than 50 lines! One caveat - so far it only inserts, but I'll be working on it more.

To understand this implementation, make sure to refer to the paper I mentioned above first, since the algorithm is described there.

So, here's the code:-

-export([insert/3, test/0]).

% Node description:-
% {Key, Value, Color, Left, Right}

balance({Kz, Vz, b, {Ky, Vy, r, {Kx, Vx, r, A, B}, C}, D}) ->
{Ky, Vy, r, {Kx, Vx, b, A, B}, {Kz, Vz, b, C, D}};

balance({Kz, Vz, b, {Kx, Vx, r, A, {Ky, Vy, r, B, C}}, D}) ->
{Ky, Vy, r, {Kx, Vx, b, A, B}, {Kz, Vz, b, C, D}};

balance({Kx, Vx, b, A, {Kz, Vz, r, {Ky, Vy, r, B, C}, D}}) ->
{Ky, Vy, r, {Kx, Vx, b, A, B}, {Kz, Vz, b, C, D}};

balance({Kx, Vx, b, A, {Ky, Vy, r, B, {Kz, Vz, r, C, D}}}) ->
{Ky, Vy, r, {Kx, Vx, b, A, B}, {Kz, Vz, b, C, D}};

% No rebalancing needed
balance({Key, Value, Color, Left, Right}) ->
{Key, Value, Color, Left, Right}.

% Inserting into an empty tree
ins(Key, Value, {}) ->
{Key, Value, r, {}, {}};

% Collision with an existing key : leave it as it is
ins(Key, _, {Key, Value, Color, Left, Right}) ->
{Key, Value, Color, Left, Right};

ins(Key, Value, {Key2, Value2, Color, Left, Right}) when Key <>
balance({Key2, Value2, Color, ins(Key, Value, Left), Right});

ins(Key, Value, {Key2, Value2, Color, Left, Right}) when Key > Key2 ->
balance({Key2, Value2, Color, Left, ins(Key, Value, Right)}).

% Ensures that the root is black
makeBlack({Key, Value, _, Left, Right}) ->
{Key, Value, b, Left, Right}.

insert(Key, Value, Tree) ->
makeBlack(ins(Key, Value, Tree)).

Friday, May 09, 2008

Graduation Day

Finally, after over three years of hard work, I actually have a diploma to prove that it was all for something! Yay!

To be honest, it felt kind of anti climactic, but I guess since I missed my bachelor's graduation I might as well attend this one. The weather wasn't great, and the ceremony was kinda boring, but it was fun nonetheless, especially with my lovely fiancée Anna being my official paparazzi and taking photos all the time! :)

Make sure to checkout some of the photos she took:-

Monday, May 05, 2008

Model Checking with SPIN and Promela

As I mentioned in my previous post, one of the most valuable tools I recently discovered for debugging multithreaded applications is model checking; more specifically the Promela language and the SPIN model checker.

Promela isn't a programming language, but a modelling language that you use to describe the algorithm you'd like to model. It's quite a simple language, without many of the features that we're used to in a proper programming language such as functions, for loops, or even proper scoping of variables. At first those limitations bothered me, but what I soon learned is that those limitations are what make Promela feasible..

You see, when you write a Promela model of your algorithm, you should only focus on the high level details of the algorithm, forget about any nasty implementation details. One of the things that you should specify are correctness assertions, or the algorithm's invariants, which if violated would indicate that there's a bug in the algorithm.

After the model is ready, spin converts it into a C program, which you compile and run. Running this program constructs the state space of the system and tries to find a state that violates some of the invariants or assertions. Since the basic type of search performs an exhaustive search of literally every possible state in the system, if it doesn't find an error - then it means that your algorithm is correct! ...... ... That is assuming of course that the model and the assumptions within it are correct as well.

The thing I like the most about it is that when it finds an error, spin is capable of producing the shortest possible path to that error! In my case, while debugging my STM, this path is usually around 100 steps. That's it! So debugging a complex multithreaded algorithm is reduced to just reading these 100 or so steps and figuring out what went wrong! As easy as debugging a single-threaded application!

That said, model checking is not a silver bullet. First of all, you need to learn a new language and a new tool. Learning how to use SPIN and Promela is actually quite easy, but still that's one more thing to do before actually debugging. Also I found that the resources available online and a bit lacking. Fortunately the book The SPIN Model Checker at the library and I think it's an excellent resource.

The other thing to keep in mind is that even if the model has no errors in it, that doesn't mean that the algorithm is correct; the model itself could be wrong, the invariants or assertions specified might not be correct either. Moreover, the problem is that the more complicated the algorithm is, the bigger the state space and more time and memory are required to verify the model. This applies to increasing the number of processes in the system as well. In my case, I was only able to do exhaustive searches for three processes. Adding one more process make it completely infeasible - it requires an estimates 4 terabytes of memory to do an exhaustive search for 4 processes in my algorithm.

You could go back to theory and prove that if the model is correct for 3 processes then it's also correct for N, but until you do that you cannot be sure that it is.

Moreover, Promela only models the algorithm. Implementation issues are hidden. So memory ordering, cache coherence and performance all aren't reflected in the Promela model. You could try to account for some of these in the model, but that would just increase the state space significantly and it's still possible that you forgot something.

I did find that to be a plus though, the fact that the implementation details aren't there means that the algorithm is clearer when written in Promela than in C for example, which makes it easier to share the core algorithm with others. I guess that's why Ananian used Promela code in his PhD thesis.

Wednesday, April 30, 2008

Debugging Multithreaded Programs

A lot of my work involves writing the code for NZSTM, a software transactional memory scheme. This is in essence a multithreaded program, and anyone who's had to write a multithreaded program knows that it's not a walk in the park. The problem is that it involves reasoning about multiple things running in parallel, and all the possible interactions between them -- whereas us as human being really have a hard time keeping track of concurrent events...

Moreover, debugging tools that are used with single-threaded applications often don't work as well in a multithreaded environment. The debugger itself introduces some sense of synchronization to the program, and often when used on a multithreaded program to try and trace it the bug just disappears (heisenbug). Even debugging by printf could mess with the whole dynamics of the system therefore hiding the bug... So what to do?

I find that careful consideration of the algorithm before coding it, then making sure to code it as clearly as possible (worry about performance later), and incrementally is one of the most important factors. Make sure to unit-test all different components of the algorithm if possible, and whenever you change something perform a regression test to make sure that it's still working fine. In a way this solutions sounds a lot like saying that to loose weight all you have to do is eat less and exercise more -- easier said than done.

That said, some bugs just won't go away! What can you do at that point?

First off, make sure that you use your assertions. Almost at every point where you can make a meaningful assertion about the state of the system do apply it. This is really helpful in narrowing down where the problem is before it actually manifests itself in other ways (i.e. crashing or producing incorrect results). Plus, adding assertions is also good as a method of self-documenting the code; a person who reads the code knows what the state of the system should (or shouldn't) be at that point.

Second thing I use is trace buffers; basically you preallocate a part of memory and store certain messages rather than printing them. It's sort of like debugging by printf except that the messages are stored in the buffer rather than printed to the screen, and you can dump the buffer after the program finishes or if it crashes (by catching the signal). Using that method still could alter the dynamics of your code, but it's less invasive than debugging by printf since printfs are quite expensive, while using a trace buffer is significantly cheaper. A nice article that talks more about trace buffers can be found in Dr. Dobbs Journal.

The last element in my debugging toolbox is model checking, in my case using SPIN and Promela. I've heard about model checking a while back and I've used it in the context of Computer Architecture (Verilog). It never occurred to me that it could help me until I read C. S. Ananian's work, where he also uses model checking.

Now that I've used Promela, I've found it to be the best tool for debugging multithreaded programs. I'll write a separate entry on Promela later. But for the time being suffice it to say that I'd recommend this tool for anyone writing even mildly complicated parallel programs.

Keep in mind though that all of these tools just help out after the fact. The most important step really is to think through the algorithm carefully and try and think through all the corner cases.

"Program testing can be used to show the presence of bugs, but never to show their absence!" -Dijkstra

Sunday, April 27, 2008

Picasa 2.7 for Linux and More Photos!

I just got around to installing Picasa 2.7 for Linux. It's still a bit rough around the edges but it's getting there. One feature that finally made it to the Linux version is the ability to upload photos! Yay! This means it's now way easier to upload photos to my Picasa album....

Therefore, I went all crazy and uploaded photos of a bunch of stuff Anna and I have been doing since we got back to New Zealand this year. Check them out and let me know what you think...

Friday, April 25, 2008

Scandinavia and Shanghai Photos

So, I finally got around to uploading some photos of my trip to Scandinavia; where I went to meet up with Anna (my fiance) and her family. I also visited good old Stefan in Stockholm, met up with Ninna there. Unfortunately it was quite a quick trip. I also managed to go to Gothenburg and give a talk, then got some sort of food poisoning or stomach flu! But I recovered quickly.

Then a quick trip to Copenhagen on the way to Shanghai, where we met up with Phoebe. Finally, back to Auckland where the PhD work was waiting...

Anyway, click on the album links below to checkout the photos!

Thursday, April 24, 2008

Geocaching for Linux

I'm into geocaching. It combines my love of the wild outdoors with my geeky side! I now actually have an excuse to muck around with a GPS device! Anyhow, I'm also a Linux user. The problem is,'s "send to garmin" feature unsurprisingly doesn't work on Linux. So I wrote a script in Python that takes cache waypoint numbers (e.g. GCVWMR) as parameters and sends the info to your garmin, along with a truncated hint. I learned Python while writing this script, so apologies for any weirdnesses therein. :)

This humble script is under the GPL3. But it would be nice if you could email me with any comments you have regarding how it works. I would also appreciate comments about my actual code (style for example).

Important: make sure to read's terms of use and don't use the script in any way that might violate those terms.

I've only tested this script using my Garmin eTrex Venture with a USB cable. You must have gpsbabel installed since my script uses it to actually send the data. Also you need to edit the script and enter your username and password since that's the only way to get the coordinates for the caches.

Here's the script:

Speaking of which, my favourite caches are Hidden in the Numbers and Battlefields. Check them out if you're in Auckland.

Edit to add: I haven't done any geocaching for a really long time now, and this script probably doesn't work. Leaving this here for historical reasons.

About Me

I am a hardworking Ph.D. student in Computer Science at the University of Auckland, New Zealand, doing research on Computer Architecture and Transactional Memory under Jim Goodman.