Fuad AlTabba.org

Wednesday, July 07, 2010

Publications and Patents

F. Tabba, A. Hay, J. R. Goodman, Transactional Conflict Decoupling and Value Prediction, Under submission.

F. Tabba, Adding concurrency in Python using a commercial processor's hardware transactional memory support. To appear in SIGARCH Computer Architecture News (CAN), ACM.

M. Moir, J. R. Goodman, F. Tabba, C. Wang, System and Method for Implementing Hybrid Single-Compare-Single-Store Operations, United States Patent 7793052, September, 2010.

M. Moir, J. R. Goodman, F. Tabba, C. Wang, System and Method for Implementing Nonblocking Zero-Indirection Transactional Memory, United States Patent Application 20090171962, July, 2009.

F. Tabba, M. Moir, J. R. Goodman, A. Hay, C. Wang, NZTM: Nonblocking Zero-indirection Transactional Memory, In Proc.21st ACM Symposium on Parallelism in Algorithms and Architectures, Calgary, Canada, July, 2009.

F. Tabba, A. Hay, J. R. Goodman, Transactional Value Prediction, in Proc. 4th ACM SIGPLAN Workshop on Transactional Computing, Raleigh, North Carolina, February. 2009.

F. Tabba, C. Wang, J. R. Goodman, M. Moir, NZTM: Nonblocking Zero-Indirection Transactional Memory, In Proc. 2nd ACM SIGPLAN Workshop on Transactional Computing, Portland, Oregon, Aug. 2007.

F. Tabba. Practical transactional memory: The hybrid approach. Master's thesis, University of Auckland, Auckland, New Zealand, Feb. 2007.

Sunday, July 13, 2008

A Scalable Red Black Tree Based Structure with Hashing

I've been trying to figure out the best way to write scalable shared data structures in Erlang. In my last post, I talked about writing one based on regular binary search trees. That implementation would create a number of processes, and divide a pre-determined key range onto those processes.

That implementation suffered from a few problems; first of all, the keys had to be integers, the range of those integers had to be known. It performed well and scaled well as long as the keys were well distributed; but if they weren't well distributed, and in the real world they probably aren't, then things would get ugly.

Fortunately the solution to all these problems turned out to be quite simple. In that last post, autrepack commented that I should use a hash (I'm glad I have a blog), and that's what I did.

No need to know the range of the keys, I just create as many processes as needed and whenever a request comes along, I hash its key using erlang:phash2 and based on that I decide which bucket (process) needs to handle it.

This automatically solved three problems; I no longer am limited to integers as key values, I don't need to predetermine the range of my keys, and load should be balanced (since I'm assuming that phash2 is a good hash function). Three birds in one stone - merci autrepack!

Also, to balance things a bit more, instead of using normal binary search trees, the data structure is now based on red black trees. The cool thing about them is that key keep the tree kinda balanced and offer worst-case scenario performance guarantees.

The result of my work can be found at:-
http://www.cs.auckland.ac.nz/~fuad/hrb.erl

I haven't tested this yet, but I imagine that under the tests I already have it would perform slightly worse than before - that's because my previous tests used uniformly generated random numbers, which pretty much guarantees that everything is well distributed - therefore the overhead of using a red black tree isn't utilized. However, I believe than under real world load this should perform better

The cool thing is, this really was easy to write compared to implementing a similar solution in other languages.

Thursday, July 10, 2008

A Scalable Binary Search Tree Based Structure in Erlang

In my previous post I talked about writing a scalable binary search tree, and the difficulties I was having.

I believe I finally managed to create the structure I wanted to. The main problem that I was having is that I was still thinking in C (or sequential languages), so what I wanted to do was build a single binary search tree (bst) that scales well. What I realized though, is that rather than doing that, I should create N bsts where N is proportional to the number of processors available in the system.

So what I'm doing now is spawn a number of threads (servers), each thread is responsible for a bst that covers a range of keys, and whenever an operation is performed, the client sends it to the particular server that's responsible for the tree in that range. If this doesn't make sense, then maybe my code will make it clearer:-

http://www.cs.auckland.ac.nz/~fuad/dbst.erl (code needs better error handling, general tidying up)

The main problem with this implementation though is that if the distribution of the keys isn't uniform within the expected range, it won't perform well. But I guess that is a problem with non-balanced bsts in general.

That said, this solution turned out to be simpler than other solutions I've attempted, performs better in the single-threaded case, and scales pretty well.

Here's a link to a graph that shows how well this scales on an 8 core machine:-

Compare with my previous attempt:-

Can't really ask for better than that.

What I need to do now is figure out a way to balance the load somehow when it's not uniform.

Monday, July 07, 2008

Problems Writing a Scalable Shared Data Structure in Erlang

[This is adapted from a post I sent to erlang-questions.]

A while back I posted a question asking how to write a scalable shared data structure in Erlang; it can be anything really but I decided to go with a binary search tree (bst) since it's a simple data structure that should scale well with load (as long as it's balanced). I talked about that attempt in my previous post.

Anyway, I finished writing the code, which I'm assuming (read: hoping) is correct. :) What I found though is that it's not scaling at all. By that I mean that as the number of threads/processes accessing the bst increases, throughput doesn't improve by as much.

I've implemented the nodes as processes, and all operations on the tree are relayed to the node that needs to do something about it. The code for this implementation can be found here:-
http://www.cs.auckland.ac.nz/~fuad/cbst.erl

I did my tests on a Sun Fire V880 with 8 900-MHz UltraSPARC-III processors and 16 GBs of main memory; running Solaris 10 and Erlang (BEAM) 5.6.3.

My tests consisted of randomly creating and populating a tree that has a key range of 0-1000; and then performing at random (uniform distribution) either an insert(), delete() or a contains() done 100,000 times. I would vary the number of threads from 1 to 8, and the load (100,000 operations) would be divided by the number of processors available. I also performed the tests where the only operation performed is a contains() operation (designated by 1:0:0), or 8 contains() operations, 1 insert() and 1 delete() (8:1:1), or an equal mix of all three (1:1:1).

In a perfect world, 2 threads would do the job in half the time 1 thread would, but of course nothing is perfect. So as a baseline I performed another test whereby each thread would just call a function (repeated for a number of times) that does nothing but return true, to see how well that scales (in the graph: Best Scalability). This used the same functions to distribute the load, except that randOperation() would only return true rather than do anything useful.

Anyway, you can find the graph (where the results are normalized to the wall time for one processor) at:-
http://www.cs.auckland.ac.nz/~fuad/bst-scale.png (the lower the lines goes, better the scalability)

and the raw data (showing the wall time in ms) at:-
http://www.cs.auckland.ac.nz/~fuad/bst-scale.csv

As you can see, my data structure isn't scaling well at all regardless of the kind of workload it has. I would expect it to scale well since the tree should be kind of balanced. In C I would know how to write an implementation where at least contains() scales well; writing something where tree modification scales is a bit more difficult but should be doable. However, I am new to Erlang and I can't really reason about all the issues well enough to pull off a similar implementation.

In a nutshell my question is; what am I doing wrong? Is there a better way to have a place that stores shared data in a scalable manner?

Monday, June 30, 2008

Concurrent and Shared Binary Search Tree in Erlang

So, wanting to work a bit more with Erlang, and to try and get a better understanding of how to write parallel programs in Erlang, I decided to write a concurrent Binary Search Tree (a basic one, not balanced) and see how that goes. I though a binary search tree would be a good candidate since operations normally affect only a small subset of the tree, so quite a bit of concurrency should be doable. Of course the more balanced the tree the better the expected concurrency.

Since I'm new to Erlang, I started by trying to parallelize a simple Erlang implementation, similar to the one in Concurrent Programming in Erlang. This implementation is essentially a tuple, and adding or removing nodes consists of modifying the tuple.

What I did was create a process that hold the current state of the BST tuple, and waits for messages that tells it what to do. It could either insert, delete or lookup a node in the tree. Of course, this basic implementation would process such requests sequentially.

I think I managed to successfully parallize the operation, since I would spawn a new process to search through the tree, but I couldn't figure out a way to parallelize insert or delete. Spawning a new processes to modify the tree in the same manner wouldn't work since I need all the modifications to be consolidated somewhere. But at least this implementation should be able to have multiple lookups running in parallel with at most one insert/delete operation.

Here's a link to the code I've mentioned above.

* * *

So I asked the Erlang questions mailing list what the best way to go about solving the problem would be, and the answer I got is that I need to think of each node as a process in itself. While this isn't how you'd tack such a problem in C, in Erlang processes are quite cheap.

So each node is a spawned instance of the following function (sorry about the lack of proper indentation):-

% Maintains the state of the node
nodeState(Node) ->
receive
{contains, Key, Requester} ->
contains(Key, Node, Requester),
nodeState(Node);

{insert, Key} ->
nodeState(insert(Key, Node));

{delete, Key} ->
nodeState(delete(Key, Node));

{node, Requester} ->
Requester ! Node,
nodeState(Node);

{dump, Requester} ->
dump(Node, Requester),
nodeState(Node)
end.

Whenever a node gets a request, it either processes it itself if it has the correct key, or it passes it on down the tree.

The complete code is here.

Implementing contains and insert were pretty straightforward since data flows in only one direction (down the tree). delete was a bit more challenging, since when deleting a node that has children, you need to find a good replacement for it and delete the replacement (i.e. replace it with the maximal node in it's left subtree and then delete that node instead). This means that messages can be exchanged in the opposite direction, which might create some potential for deadlock.

That said, I have to admit that it was significantly easier writing this kind of algorithm in Erlang as it would have been in C; that is of couse assuming that my code is correct.

A note about the interface (the rpcs for insert, delete, contains); no acknowledgement is sent after an insert or a delete. contains of course waits until a response is received. I hope that this implementation is correct in the sense that if something is inserted, then contains is called the response should be positive unless another process deleted what was inserted and vice versa.

Now what's still remaining is performing a sanity test to see whether my implementation is correct (Dijkstra: testing can be used to show the presence of bugs, but never to show their absence). But that's what I can do for the time being.

I also need to figure out a way to write a benchmark to test performance. This is going to be tricky since processes do not acknowledge that operations have been completed. I could add that, but by adding that I would be adding more overhead....

So, what I like about this implementation is that it was relatively easy to reason about; what I don't like is the overhead of a process per node. I recall reading that an Erlang process uses up something in the neighbourhood of 200 bytes; so if the data stored per process is small, this overhead is substantial.

Next step; write a concurrent Red-Black tree!

Monday, June 16, 2008

Full Red Black Tree Implementation in Erlang

In my previous post I talked about Red Black trees and how to implement them in Erlang. I only dealt with inserting into a Red Black tree. I finally got around to finalizing this implementation by adding the code for deleting from a Red Black tree.

There were some challenges; the ones that stand out the most are:-

The algorithm for deleting is not an easy one, and I actually had to understand it this time (as opposed to just rewriting pseudocode into C as when I did the C implementation). This link, that Robert Virding pointed out to me was really helpful in this respect.
The fact that variables are immutable in Erlang, and the whole functional programming way of thinking is still not coming naturally to me. I guess I need to practice more.
Erlang doesn't allow functions as guards, even if those functions don't have side effects. You'd think it would be trivial to actually have the compiler check for that or for the VM to throw an exception at runtime if it encounters a guard with side-effects; but apparently that's completely not allowed for historical reasons or something. Fortunately with help from the erlang questions list, I was able to get around this limitation by using macros.

With all of that being said, and with the learning curve involved, I found it significantly easier to write what seems to be a correct implementation of a Red Black Tree in Erlang than it was in C. The code is significantly more concise and easier to read/understand I believe.

For those who are interested, here's a link to my full Erlang Red Black Tree implementation. As I mentioned before, I'm still new to Erlang, so I would greatly appreciate any comments you might have on my code; whether it's related to style or the actual algorithm. That said, I know a couple of places where my code could be optimised, but I was aiming for clarity. But if you have any suggestions that could improve both, then I'd like to know about them.

Now off to do some work...

Monday, June 09, 2008

Erlang and Red-black Trees

Update: Full RBTree implementation in Erlang.

Recently I've been trying to pick up Erlang, a message-based concurrent functional programming language. The main difficulty I'm having is wrapping my head around the whole functional thing and dealing with single assignment (immutable data). I thought what better way to try and get it than to implement my favourite data structure - the venerable Red-black tree!

The RB Tree is a self-balancing binary tree, and all the colouring and rebalancing business can be a bit tricky. To give you an idea, my C implementation is roughly around 1000 lines of code (including comments/spaces).

To be honest, it hasn't been easy. Fortunately, I stumbled upon this entry which mentioned a functional implementation in Haskell. With my good friend Nick's help in understanding the Haskell bit, I managed to whip up an Erlang implementation in less than an hour and less than 50 lines! One caveat - so far it only inserts, but I'll be working on it more.

To understand this implementation, make sure to refer to the paper I mentioned above first, since the algorithm is described there.

So, here's the code:-


-module(rbtree).
-export([insert/3, test/0]).

% Node description:-
% {Key, Value, Color, Left, Right}

balance({Kz, Vz, b, {Ky, Vy, r, {Kx, Vx, r, A, B}, C}, D}) ->
   {Ky, Vy, r, {Kx, Vx, b, A, B}, {Kz, Vz, b, C, D}};

balance({Kz, Vz, b, {Kx, Vx, r, A, {Ky, Vy, r, B, C}}, D}) ->
   {Ky, Vy, r, {Kx, Vx, b, A, B}, {Kz, Vz, b, C, D}};

balance({Kx, Vx, b, A, {Kz, Vz, r, {Ky, Vy, r, B, C}, D}}) ->
   {Ky, Vy, r, {Kx, Vx, b, A, B}, {Kz, Vz, b, C, D}};

balance({Kx, Vx, b, A, {Ky, Vy, r, B, {Kz, Vz, r, C, D}}}) ->
   {Ky, Vy, r, {Kx, Vx, b, A, B}, {Kz, Vz, b, C, D}};

% No rebalancing needed
balance({Key, Value, Color, Left, Right}) ->
 {Key, Value, Color, Left, Right}.

% Inserting into an empty tree
ins(Key, Value, {}) ->
 {Key, Value, r, {}, {}};

% Collision with an existing key : leave it as it is
ins(Key, _, {Key, Value, Color, Left, Right}) ->
 {Key, Value, Color, Left, Right};

ins(Key, Value, {Key2, Value2, Color, Left, Right}) when Key <>
   balance({Key2, Value2, Color, ins(Key, Value, Left), Right});

ins(Key, Value, {Key2, Value2, Color, Left, Right}) when Key > Key2 ->
   balance({Key2, Value2, Color, Left, ins(Key, Value, Right)}).

% Ensures that the root is black
makeBlack({Key, Value, _, Left, Right}) ->
 {Key, Value, b, Left, Right}.

insert(Key, Value, Tree) ->
 makeBlack(ins(Key, Value, Tree)).