Reliably rebooting Ubuntu using watchdogs
Rebooting Ubuntu is hard. I don’t really know why, but in my twelve years as an Ubuntu user, I’ve encountered countless “stuck at reboot” scenarios. Somehow, typing reboot always comes with that extra special feeling of uncertainty and the thrill of danger — Will it come back? Where will it get stuck this time? If it’s your home computer or your laptop, that’s fine, because you can always manually hard reset. If it’s a remote computer to which you have IPMI access, it’s a little bit annoying, but not tragic. But if you’re attempting to reboot tens of thousands of devices across the globe, that level of uncertainty is nothing short of terrifying.
I know I’m being unfair, because more often than not, rebooting Ubuntu actually completes successfully. However, my incredibly unscientific estimate of how often things get stuck forever on shutdown or reboot is this: 1-3%. That’s how often I believe reboots hang. That’s shockingly high, right? Well, I pulled that out of my hat, but that estimate is based on many hundred thousands of reboots I’ve witnessed in our fleet of backup devices. That number is not too terrible when you deal with a handful of machines that you rarely ever reboot. It is, however, incredibly terrible if you reboot tens of thousands of devices running Ubuntu every two weeks as part of an upgrade process (I wrote about our image based upgrade mechanism in another post).
This post describes the short story of how we managed to make Ubuntu machines reliably reboot.