I ran into a fantastic bug at work this past week. We had a component which would periodically crash once every couple weeks. Memory usage wasn't growing over time, and neither was CPU utilization increasing. There wasn't any pattern to the specific files which were being processed at the time of the crash. A break came when we realized that it consistently crashed after processing approximately thirty-two thousand files.
Further, tracing through our logs we were pretty certain that the line causing the crash was this one:
And we were also finding this error message in our logs:
I'm not sure exactly how I would have approached this a year ago, but I had recently read Joe Armstrong's interview inCoders at Workand he mentions how he frequently writes simple scripts to check performance, toy with algorithms and so on, and that seemed like exactly the right approach for figuring out this problem.
First I was curious if the issue was in
global:register_name/2so I wanted to try registering a single process more than 32,000 times.
This did generate a number of errors (
globaldoesn't like a single pid being registered with multiple names), but it didn't ever generate the
Too many processeserror, so it seemed that the issue could be reproduced by registering the same process more than 32,000 times.
The next step was to try registering 32,000 different processes.
The second attempt produced the desired failure, failing with the
Too many processesmessage after run for a bit. Finally, I needed to clarify if the issue was caused by creating too many processes or, registering too many different processes.
To do that, I added a timeout to the functions so that they only leaved for one second instead of lasting forever.
This script runs all the way to 100,000 instead of crashing at 32,000. It turned out the exact number is 32768, and it can be overridden by passing
+Pat the command line as described in theerldocumentation.
Of course, increasing the number of allowable processes wasn't quite the right solution, the real solution was to figure out why we had 32,768 processes hanging around and not getting cleaned up (turns out in a partially implemented section of the code something was getting spawned waiting for a message which it never received).
Going forward I think I'll be writing many more small test scripts as the latest addition to my debugging toolkit.