A lot of my work involves writing the code for NZSTM, a software transactional memory scheme. This is in essence a multithreaded program, and anyone who's had to write a multithreaded program knows that it's not a walk in the park. The problem is that it involves reasoning about multiple things running in parallel, and all the possible interactions between them -- whereas us as human being really have a hard time keeping track of concurrent events...
Moreover, debugging tools that are used with single-threaded applications often don't work as well in a multithreaded environment. The debugger itself introduces some sense of synchronization to the program, and often when used on a multithreaded program to try and trace it the bug just disappears (heisenbug). Even debugging by printf could mess with the whole dynamics of the system therefore hiding the bug... So what to do?
I find that careful consideration of the algorithm before coding it, then making sure to code it as clearly as possible (worry about performance later), and incrementally is one of the most important factors. Make sure to unit-test all different components of the algorithm if possible, and whenever you change something perform a regression test to make sure that it's still working fine. In a way this solutions sounds a lot like saying that to loose weight all you have to do is eat less and exercise more -- easier said than done.
That said, some bugs just won't go away! What can you do at that point?
First off, make sure that you use your assertions. Almost at every point where you can make a meaningful assertion about the state of the system do apply it. This is really helpful in narrowing down where the problem is before it actually manifests itself in other ways (i.e. crashing or producing incorrect results). Plus, adding assertions is also good as a method of self-documenting the code; a person who reads the code knows what the state of the system should (or shouldn't) be at that point.
Second thing I use is trace buffers; basically you preallocate a part of memory and store certain messages rather than printing them. It's sort of like debugging by printf except that the messages are stored in the buffer rather than printed to the screen, and you can dump the buffer after the program finishes or if it crashes (by catching the signal). Using that method still could alter the dynamics of your code, but it's less invasive than debugging by printf since printfs are quite expensive, while using a trace buffer is significantly cheaper. A nice article that talks more about trace buffers can be found in Dr. Dobbs Journal.
The last element in my debugging toolbox is model checking, in my case using SPIN and Promela. I've heard about model checking a while back and I've used it in the context of Computer Architecture (Verilog). It never occurred to me that it could help me until I read C. S. Ananian's work, where he also uses model checking.
Now that I've used Promela, I've found it to be the best tool for debugging multithreaded programs. I'll write a separate entry on Promela later. But for the time being suffice it to say that I'd recommend this tool for anyone writing even mildly complicated parallel programs.
Keep in mind though that all of these tools just help out after the fact. The most important step really is to think through the algorithm carefully and try and think through all the corner cases.
"Program testing can be used to show the presence of bugs, but never to show their absence!" -Dijkstra