Correct Backups Require Filesystem Snapshots

Craig Younkins
4 min readMay 16, 2022

--

What I imagine data races look like. Photo by Drew Perry, CC BY-NC-ND 2.0. No changes.

No matter what backup program you run, using a point-in-time filesystem snapshot is required to achieve correct backups of a running system. A handful of backup programs handle these automatically, but most require you to do it yourself.

If you run a backup without snapshot support while web browsing, there’s a good chance that the backup’s copy of your browser profile will be corrupt. The browser is constantly downloading resources and writes metadata to a database that references those resources. The backup program has to scan for changes, then read and back up each file, usually only handling one or two at a time. If the scan finds the database has changed but the browser writes to it before the backup program handles it, the database will have a reference to a file not in the backup. The result is an inconsistency between the database and other files within the backup.

cat.jpg is never backed up, but metadata.db contains a reference to it

An entirely different issue can occur when simply copying a single file. Firefox and Chrome both store cookies in a separate SQLite database. Even copying just this database by itself while it is concurrently being written can result in a corrupt copy and is strongly advised against.

Copying a file that is being written to can result in a copy that never matched the source at any one point in time

This issue isn’t limited to web browsers. Applications that use databases internally are more likely to be affected due to the random nature of writes, and that includes messaging apps like Signal and Discord, document managers like DEVONthink, and even terminal applications like iTerm2. But as we’ve seen, the issue is present even with individual files. It really gets into the weeds of how the program is written, but if you’re very unlucky even a simple text document could be incorrectly backed up if it’s saved during the backup.

In all cases, you end up with a backup that never matched all bytes on disk at any one time. Each read operation was correct, but because there were many read operations over time, the result as a whole is incorrect.

Ideally, our backup would look like the system had lost power in an instant. It’s still possible for there to be half-complete writes, but at that point it is up to the applications to handle such edge cases correctly. Whereas before the applications couldn’t detect or fix the corruption, now they have all the information to do so.

Filesystem snapshots provide exactly what we need — a consistent view, block for block, of all files at a single point in time even while the system continues running. Most often these snapshots are accomplished using copy-on-write filesystems, where changes are written as new blocks and then linked in with an atomic operation. In copy-on-write filesystems, point-in-time snapshots are easily accomplished by keeping the old blocks around instead of recycling them.

“ZFS: The Last Word In File Systems” Probably copyright Sun Microsystems

Now we can cover the practical details of how to best backup a running system. On Linux I strongly recommend ZFS due to its long history of reliability, but you could also use btrfs or LVM. Few tools that I’ve seen automatically handle snapshots on Linux, so you need to wrap the invocation with code to manage them:

#!/usr/bin/env bash

# Exit on error. Append "|| true" if you expect an error.
set -o errexit

# Destroy old snapshot, if any
zfs destroy tank@snap || true

# Create snapshot
zfs snapshot tank@snap

# Backup read-only snapshot via its mount point
restic backup /tank/.zfs/snapshot/snap/

# Destroy the snapshot
zfs destroy tank@snap

On macOS, APFS supports snapshots. The functionality is tightly integrated with Time Machine, so while it’s hard to make assertions about something that’s closed source and has poor technical documentation, hopefully Time Machine backups don’t have this problem. If you use any command line tools you’ll need to manage your own snapshots through tmutil.

#!/usr/bin/env bash

# Exit on error. Append "|| true" if you expect an error.
set -o errexit

VOLUME_PATH="/System/Volumes/Data"
SNAP_DATE=$(tmutil snapshot "$VOLUME_PATH" | tail -n1 | sed s/'Created local snapshot with date: '//)
echo "Created snapshot $SNAP_DATE"

LOCAL_MOUNTPOINT="/tmp/snapshotbackup"
mkdir "$LOCAL_MOUNTPOINT" || true
umount "$LOCAL_MOUNTPOINT" || true

mount_apfs -o rdonly -s "com.apple.TimeMachine.${SNAP_DATE}.local" "$VOLUME_PATH" "$LOCAL_MOUNTPOINT"

restic backup "${LOCAL_MOUNTPOINT}/Users/craig"

umount "$LOCAL_MOUNTPOINT"
echo "Deleting snapshot $SNAP_DATE"
tmutil deletelocalsnapshots "$SNAP_DATE"

On Windows, NTFS doesn’t have native support for snapshots. Instead, Microsoft has the Volume Shadow Service (VSS), which is a sort of complex dance between an intermediate write buffer, the software doing the writes, and the backup software. While most commercial backup software supports VSS, it’s unclear to me how safe the system actually is and if common user applications like browsers are protected.

Take a look at your backups and consider whether any writes could occur during the backup. In most cases it’s plausible or even likely for writes to happen, so be on the safe side and use a snapshot!

Notes

Thank you to David Orr for reviewing this post.

Diagrams made with Excalidraw.

Discussion on HackerNews

--

--

Craig Younkins

Hacker, entrepreneur, and quantified self nerd. cyounkins at gmail.