I’ve been around Linux and UNIX for quite some time and one thing that has always piqued my interests is debugging broken software. Bryan Cantrill made some excellent points on why postmortem debugging is needed at DOCKERCON and the following video is a must watch:
His points on restarting a broken container w/o root causing the source of the failure is SPOT ON! I also love his mad cow analogy. I’ve had the same mind set since I started managing infrastructure and I find the whole root cause process exciting and fun. Who doesn’t love looking at backtraces, registers and memory on the stack and heap?!!?! Most admins I’ve met like debugging but they dread seeing the following:
$ ./badproc
Segmentation fault (core dumped) ./badproc
I on the other hand start to drool when a piece of software I manage (but didn’t write) encounters a fatal condition that leads to its demise. If I can’t locate a bug report with a fix I’ll grab a cup of coffee, ensure debugging symbols are present and fire up gdb to root cause the failure. My first experience root causing a segmentation violation was with snort. This was an extremely valuable learning experience and at the time the Internet had limited resources explaining stack layouts, memory organization and how gdb can be used to locate problems. Now that conferences and individuals are posting high quality material to Youtube two clicks will get you access to amazing gdb resources like this (all three videos are definitely worth watching):
We also have access to step-by-step software debugging guides like the one Brendan Gregg posted to his blog last year. This coming weekend I will be immersing myself in another epic debugging session and I can’t wait to see what I find (and learn). We all need to learn to embrace the unhappy signals that take down our applications. You learn a TON by doing so and make the opensource world better at the same time.