My name is Philipp C. Heckel and I write about nerdy things.
This site moved here from!


  • Oct 08 / 2020

Reliably rebooting Ubuntu using watchdogs

Rebooting Ubuntu is hard. I don’t really know why, but in my twelve years as an Ubuntu user, I’ve encountered countless “stuck at reboot” scenarios. Somehow, typing reboot always comes with that extra special feeling of uncertainty and the thrill of danger — Will it come back? Where will it get stuck this time? If it’s your home computer or your laptop, that’s fine, because you can always manually hard reset. If it’s a remote computer to which you have IPMI access, it’s a little bit annoying, but not tragic. But if you’re attempting to reboot tens of thousands of devices across the globe, that level of uncertainty is nothing short of terrifying.

I know I’m being unfair, because more often than not, rebooting Ubuntu actually completes successfully. However, my incredibly unscientific estimate of how often things get stuck forever on shutdown or reboot is this: 1-3%. That’s how often I believe reboots hang. That’s shockingly high, right? Well, I pulled that out of my hat, but that estimate is based on many hundred thousands of reboots I’ve witnessed in our fleet of backup devices. That number is not too terrible when you deal with a handful of machines that you rarely ever reboot. It is, however, incredibly terrible if you reboot tens of thousands of devices running Ubuntu every two weeks as part of an upgrade process (I wrote about our image based upgrade mechanism in another post).

This post describes the short story of how we managed to make Ubuntu machines reliably reboot.

I cross-posted this post on the Datto Engineering Blog. Feel free to head over there and check out the other cool things our engineers do.

Continue Reading

  • Sep 18 / 2019

Image based upgrades: Upgrading software and OS of 80k servers every two weeks

Anyone that’s ever managed a few dozen or hundreds of physical servers knows how hard it can become to keep all of them up-to-date with security updates, or in general to keep them in sync with their configuration and state. Sysadmins typically solve this problem with Puppet, or Salt or by putting applications in a container. While those are great options if you control your environment, they are less applicable when you think about Datto‘s BCDR appliance (or really any appliance/server that doesn’t reside in your infrastructure). On top of that, replacing the kernel, major distribution upgrades or any larger upgrades that require a reboot are not covered by these solutions.

Being faced with this problem for Datto’s fleet of BCDR devices, we started exploring alternative options and came up with something that has worked reliably for almost two years for a fleet of now over 80,000 devices. In this blog post, I’d like to talk about how we solved this problem using images, loop devices and lots of Grub-magic. If you’d like to know more, keep reading.

I originally published this post on the Datto Engineering Blog. Feel free to head over there and check out the other cool things our engineers do.

Continue Reading

  • Jul 22 / 2019

Deduplicating NTFS file systems (fsdup)

At Datto, we store hundreds of thousands of block-level backups for our customers. Since our customer base is mostly Windows focused, most of these backups are copies of NTFS file systems. As of today, we’re not performing any data deduplication on these backups, which is pretty crazy considering that how well you’d think a Windows OS will probably dedup.

So I started on a journey to attempt to dedup NTFS. This blog post briefly describes my journey and thoughts, but also introduces a tool called fsdup I developed as part of a 3 week proof-of-concept. Please note that while the tool works, it’s highly experimental and should not be used in production!

Continue Reading

  • Aug 05 / 2018
  • 32
Linux, Security

Using Let’s Encrypt for internal servers

Let’s Encrypt is a revolutionary new certificate authority that provides free certificates in a completely automated process. These certificates are issued via the ACME protocol. Over the last 2 years or so, the Internet has widely adopted Let’s Encrypt — over 50% of the web’s SSL/TLS certificates are now issued by Let’s Encrypt.

But while there are many tools to automatically renew certificates for publicly available webservers (certbot, simp_le, I wrote about how to do that 3 years back), it’s hard to find any useful information about how to issue certificates for internal non Internet facing servers and/or devices with Let’s Encrypt.

This blog posts describes how to issue Let’s Encrypt certificates for internal servers. At Datto, we issued a certificate for each of our 65,000 90,000+ BCDR appliances using this exact mechanism.

Continue Reading

  • Mar 18 / 2018
Linux, Virtualization

USB disk causes blinking cursor at boot; how to “fix” the MBR bootstrap code

Have you ever rebooted your computer only to see a black screen with a blinking cursor? If you have a USB drive attached, chances are the blinking cursor is caused by invalid bootstrap code in the Master Boot Record (MBR) on that drive which has caused the normal boot execution to stop without returning control to the BIOS. If you have physical access to the machine, simply remove the USB drive and/or change the boot order to pick the OS disk first.

If you have no physical access, things are a bit more tricky: This exact thing happened to me at work the other day. Unfortunately, it didn’t happen to my computer, but to a few dozen of our customer backup appliances during their scheduled upgrade/reboot. Now, while dozens out of over 60k isn’t that much, our customers rely on these devices, so it’s not acceptable to have them not boot properly.

In this short post, I’ll demonstrate how to reproduce the blinking cursor problem, and how to “fix” the MBR to ensure the computer still boots, regardless of the boot order.

Continue Reading

  • May 28 / 2017
  • 9

Creating a BIOS/GPT and UEFI/GPT Grub-bootable Linux system

Good old Master Boot Record (MBR) unfortunately cannot address anything beyond 2TB, so partitioning large disks and making them bootable is impossible using MBR. The GUID Partition Table (GPT) solves this problem: It supports disks up to 16EB. However, installing grub does not work without a special BIOS boot partition. If you also want to support booting the same system via UEFI, another partition, the EFI System Partition (ESP), is necessary.

This should post shows you how to partition a disk with GPT and make a bootable Linux system via BIOS/Legacy and UEFI.

Continue Reading

  • Jan 08 / 2017
  • 29
Administration, Linux

How-To: Using ZFS Encryption at Rest in OpenZFS (ZFS on Linux, ZFS on FreeBSD, …)

An upcoming feature of OpenZFS (and ZFS on Linux, ZFS on FreeBSD, …) is At-Rest Encryption, a feature that allows you to securely encrypt your ZFS file systems and volumes without having to provide an extra layer of devmappers and such. To give you a brief overview of what the feature can do, I thought I’d write a short post about it.

The current ZFS encryption implementation is not (yet) merged into the upstream repository (as of January 2017). There is a pretty big pull request which is still being reviewed, but because the feature is so incredibly cool (and because my colleague Tom Caputi at Datto developed it), I thought a sneak preview is absolutely necessary.

Continue Reading

  • Jan 01 / 2017
Administration, Linux

zfsu: ZFS utils for offsite backup, retention and maintaining a slow mirror

My laptop runs ZFS as its root file system (see this blog post) — meaning that I can snapshot my root file system and I can send it to another machine as a backup very easily. Unfortunately, while ZFS provides the raw functionality, there is no great tool to manage offsite backups and retention. To ease this pain, I wrote/forked and packaged a few helper scripts which I called zfsu, a collection of ZFS utilities.

It consists of the following tools: zfsu tx (aka zfstx) maintains a mirror of a ZFS pool over the network. zfsu ret (aka zfsret) is a simple script to apply local retention (destroy snapshots) of a file system and its snapshots. zfsu res (aka zfsres) is a script to resilver a slow mirror, e.g. a HDD disk if mirrored with a SSD.

Continue Reading

  • Dec 31 / 2016
  • 19

How-To: Move your existing Linux install to ZFS on Root

Ever since I joined Datto two years ago, ZFS has been part of my work every day. And every day, I am amazed how great it is. So naturally, I wanted to move my existing Linux Mint 18 installation to boot off of ZFS. Why, you may wonder? Well that’s easy. Because now I can snapshot my root file system, I can roll back if I need to, and I can restore individual files in a heartbeat.

It took a bit of fiddling in the beginning, but once you know how it works, it’s a piece of cake. This short post shows you how to move your existing Linux installation to ZFS on root (preferably Ubuntu 16.04+ based, may work for others).

Continue Reading