Monday, August 9, 2010

Don't Be Stupid

Library

Time for a bit of self-flagellation, I think. Nothing like public humiliation to remind you not to do bad things. Feel free to skip this if you're not into programming. The takeaway message is that I'm a bumbling dolt - not that most you aren't aware of it already.

Here's the thing: I'm creating some simulations in Python, a common programming language. Well, the simulation itself is run in highly optimized parallel C++ code (we use Nest; it's good, take a look), but we use Python to create the model, set up and run the simulation and collect and save the resulting data.

We run quite a lot of simulations, so I want to collect the data from one simulation run into one catalog. So, towards the end of the simulation we create a new catalog:

os.makedirs(catalog)

Now, the problem is that if the catalog already exists this will fail. Python will complain and stop. We need to check that the catalog doesn't already exist first:

if not os.path.exists(catalog):
os.makedirs(catalog)

Great! Problem solved. Job well done, beers for everyone, pat yourself on the back. Except...

These simulations are pretty heavy. They take a long time to run. To make things go faster we use many CPU's in parallel. On my desk I have an 8-cpu computer which helps a bit, and I have access to 256 cpu's in a cluster in Tokyo if I need it. Eventually, of course, this project (where this model is just one part) aims to use the new super-computer in Kobe when it comes online next year, and that one has more cpu's than you can shake a very long stick at.
 
This means that the python code above isn't run just once for a simulation, but once for each cpu we use (some of you probably realize where this is going already). Normally this is not a problem: the first process to finish will create the new directory. The other processes will find the directory is already there and skip the creation.

But here's the problem: there's a tiny bit of time in between checking if the catalog exists, and creating it if it doesn't. What happens if two processes try to do this at the same time? Both find that the catalog does not exist, one of them creates the catalog, then the other tries to create it too - it wasn't there a moment ago after all - and the whole simulation fails. That's what happens.

This is called a "race condition", because you have two (or more) processes 'racing' each other, and you get different results depending on which one happens to get there first.

The chance of this actually happening - that two processes manage to get the timing so exactly wrong - seems really slim of course. It may indeed be a really rare and unlucky event if you just run a simulation once and get hit by this. But if you're running lots of simulations, and using lots of cpu's, then you're bound to get hit by this bug sooner or later.

And I did, this weekend. I had started a long series of simulations to run over the weekend. As I came to work this morning I found that the series had failed about halfway through because of this bug. Bad programmer. Bad, bad programmer. No coffee for you.

So, instead of almost three days worth of simulation data to analyze, I now have to spend another day generating the missing stuff - a day that I can ill afford with deadlines looming like thunderclouds over my head.

What should I have done? What should you do? Something like this:

try:
os.makedirs(catalog)
except OSError, e:
if e.errno == errno.EEXIST:
pass
else:
raise

This is Python's way of dealing with exceptions - run-time errors such as failing to create a catalog. We don't check beforehand if the catalog exists, but simply try to create it. If we fail, we don't just stop. Instead we catch the error (that's the "except" bit). If the error is that the catalog already exists we just ignore it and continue the simulation (the "pass" thing). If it was some other kind of error we send it on for the system to take care of ("raise", as in raise a flag to alert that something is wrong).

I did not do this. Which made me lose a day's worth of data analysis. And makes me a blockhead. Don't be a blockhead like me. Take care of race conditions when you program. Do proper error checking and recovery. Think of the future. Think of the children.

4 comments:

  1. Quite useful to see some real-life usage of exceptions. I'm learning Python now but didn't get to read much about or use exceptions yet.
    I will probably not run into parallel processing problems anytime soon, though: Quantum optics simulations are pretty light on computing power.

    ReplyDelete
  2. It's probably a good idea to get into the habit of doing this sort of thing right from the beginning. I know about this sort of thing and it still got me. The reason I tripped up here was really that I first wrote this data-saving code for a very early, small version of the model over a year ago, and I've just kept using it even as the model itself has changed and grown.

    Of course, you have to temper the urge to Do It Right with the feeling that It's Good Enough; if we had to implement full error recovery into every single one-shot or throwaway script then we'd never get anything done.

    ReplyDelete
  3. Janne, don't beat yourself up about this. While I do understand the implications schedule-wise, we've all (all of us who've written parallel code, that is) run into race conditions once or twice. Parallel computing is not as natural as "normal" computing is (assuming there is such a thing as "normal" computing, that is).

    First comment from me as I'm (relatively) new here. Found your blog through TOP some time ago already, but I finally started reading it only recently. Glad I did start though, as I really enjoy both your posts and photographs. Thanks for sharing and keep 'em coming!

    Cheers,
    Thomas

    ReplyDelete
  4. Catching up on my reader feeds…

    It's funny, I've been reading your blog for a while (interesting stories, interesting photography), and yet I've never realized that you sometimes deal with the same things I do :)

    A similar experience that I had was that (for Linux) some things in /proc are not synchronised across all CPUs, so if you run just one thread all might be fine, but once you scale to N threads and rely on /proc being consistent then you'll get once-in-a-while random errors that you can't really reproduce the cause… I would assume this to be even more true on big computers than normal servers.

    regards,
    iustin

    ReplyDelete

Comment away. Be nice. I no longer allow anonymous posts to reduce the spam.