I just spent several (as in > 8) hours between last night and tonight diagnosing what appeared to be a threading issue in a fairly large and complex code base, and it just reminded me just how hard diagnosing race conditions in multi-threading code can be.

While the code has some unit tests (not as many as I'd like; working on that), the problem only became apparent in a larger, very specific scenario involving long execution times and a big load (in fact running the whole scenario uses 100% CPU time for more than 10 minutes). The entire test scenario actually moves some 40.000 messages or so.

Directly diagnosing these things in the debugger is usually impossible. Either the debugger gets in the way (as it affects how multi-threaded code runs) or you can only see the results of the problem (i.e. catch an unexpected exception as it is thrown, which may be too late to see what the problem actually was). Unfortunately this may mean resorting to good old Debug.WriteLine().

One thing that made this problem a lot harder to diagnose than I expected was that trying putting a simple Debug.WriteLine() statement is some significant points of the code base, cause the problem to simply go away. Not cool. This mean I couldn't output too much data and instead had to infer a few things from what little trace information I could get.

I eventually was able to get to the source of the problem (at least I think I did). Turned out to be, indeed, a nasty race condition: A buffer was compacted and reused while some upper layers of the stack were still holding on to "views" of sections of the buffer without actually having consumed them yet (think something like sliced ByteBuffers in Java). Right now, I have the local scenario running with no errors, though I still have some cleanup to do after messing around a lot to get to the bottom of the problem :)

Technorati tags: ,

Tomas Restrepo

Software developer located in Colombia.