Archiving And Backup

6 min read

Core idea

Unix solved the backup problem by separating concerns. Archiving — packing many files into one stream while preserving permissions, ownership, and directory structure — is one job, and tar does it. Compression — squeezing redundancy out of a stream of bytes — is a separate job, and gzip/bzip2/xz do it. The two compose: tar -c dir | gzip > out.tar.gz. Synchronization — keeping two trees in step with minimum data transfer — is a third job, and rsync does it, transferring only the delta on each run. Each tool is small. Combined, they cover the entire backup surface from a one-off tarball on a USB stick to an incremental remote mirror over SSH.

Shotts's argument: Archiving and compression are different things and Unix kept them apart on purpose. zip smashes them together — convenient for Windows interop, but the Unix way is more flexible, more streamable, and more composable.

Why it matters

Backups are the only thing standing between you and disaster

Disks fail. Filesystems corrupt. People type rm -rf into the wrong terminal. Ransomware encrypts every accessible file. None of these are hypothetical — every long-time computer user has lost data to at least one. A backup is not optional infrastructure; it is the only realistic answer to data loss, and the difference between an inconvenience and a catastrophe is whether you had one.

Composition is the Unix philosophy in miniature

tar c dir | gzip | ssh remote 'cat > dir.tar.gz' packs a directory, compresses it, and ships it to a remote host in one pipeline, without ever writing the archive to local disk. That single line uses four tools, each doing one job, chained by stdin/stdout. Learning this idiom in the context of backup teaches you a pattern that recurs everywhere in shell scripting.

rsync changed what's possible

Before rsync, "synchronizing" two trees meant copying everything every time. rsync hashes file contents in blocks, compares with the destination, and sends only the blocks that differ. A nightly backup of a 50 GB project may transmit a few megabytes. That changes the economics — incremental backups become trivial, cheap enough to run hourly.

The 3-2-1 rule is durable wisdom

Keep at least three copies of your data, on at least two different media types, with at least one copy stored off-site. Each clause defends against a different failure: three copies for "one is corrupted," two media types for "all SSDs of this batch failed," off-site for "the building burned down or got encrypted by malware." The rule predates the cloud and still applies — it tells you how many copies, not where.

Key takeaways

Mental model

Three jobs, three tools, one pipeline

A backup pipeline has three logical stages. Select which files to include. Pack them into a single transport-friendly unit. Compress if bandwidth or storage is tight. Each stage maps to a tool, and the tools chain by pipe.

Three jobs, three tools, one pipeline

Full versus incremental

A full backup snapshots everything every time. Simple and self-contained — restoring just means extracting the most recent archive — but expensive in time and space. An incremental backup snapshots only what changed since the last backup. Smaller and faster, but restore requires the full plus every incremental since. rsync --link-dest and tar --listed-incremental are the classic implementations.

A practical middle ground: full weekly, incremental nightly. You restore by extracting the most recent full, then layering the incrementals on top in order. The "Tower of Hanoi" rotation schedule, used by professional backup systems, generalizes this — but for personal use, weekly full + nightly incremental is usually plenty.

The trailing-slash trap

rsync src dest and rsync src/ dest do different things, and the mistake bites everyone at least once. Without the trailing slash, you copy the directory src into dest, producing dest/src/.... With the trailing slash, you copy the contents of src into dest, producing dest/.... Pick one mental model and stick with it; my preferred rule is "always include the slash and always treat dest as the target directory."

Practical application

For one-off archives (sharing a project folder, attaching to an email): tar -czf project.tar.gz project/. The czf is "create, gzip, file." The result is portable to any Unix system.

For full system snapshots before a risky upgrade: sudo tar --acls --xattrs -cpzf /backup/snapshot-$(date +%F).tar.gz /etc /home /root /usr/local. The --acls --xattrs preserve access control lists and extended attributes; -p preserves permissions. Run from a non-root shell with sudo so the file ends up owned by you in /backup.

For ongoing personal backups to an external drive: rsync -avh --delete ~/ /mnt/backup/home/. Add --exclude='.cache/' and --exclude='node_modules/' to skip rebuildable junk. Make the command an alias in ~/.bashrc so you can run it on muscle memory before unplugging the laptop for travel.

For off-site backups to a remote server: rsync -avhz --delete --rsh=ssh ~/important/ deploy@remote:/backup/laptop/. The -z compresses on the wire — big win over residential uplink. Schedule with cron or systemd timer.

For very large archives where compression speed matters: use xz (tar -cJf out.tar.xz dir) for best compression, pigz (parallel gzip) for fast compression on multi-core machines. Default gzip is a reasonable middle ground.

Example

You're a freelance developer working from a laptop. You want a backup strategy that satisfies 3-2-1 without requiring much daily effort. Here is one workable design.

Copy 1 — the laptop itself. Original data lives in ~/projects, ~/Documents, and ~/.config. (Implicit, not a backup; the source.)

Copy 2 — local hourly snapshot. An external USB SSD attached when at your desk. A shell alias runs rsync -avh --delete --exclude='.cache/' --exclude='node_modules/' ~/projects ~/Documents ~/.config /mnt/backup/laptop/. You run it before stepping away — takes seconds after the first time because only deltas transfer. Once a day a cron job calls the same alias automatically if the drive is mounted. This is media type 1 (external SSD).

Copy 3 — nightly off-site to a $5 VPS. A systemd timer runs rsync -avhz --delete --rsh=ssh ~/projects deploy@vps:/backup/laptop/projects/ at 02:00. SSH key auth means no prompts; --bwlimit=2000 caps the upstream so you don't notice. This is media type 2 (remote magnetic / cloud-managed storage) and off-site.

Copy 4 — weekly full tarball to cold storage. Every Sunday, a script creates /mnt/backup/full-$(date +%F).tar.gz containing all of ~. Once a month you copy the most recent tarball to a second external drive that lives at a relative's house. This is your nuclear-option restore — readable on any Unix system, no rsync server required.

Now count the clauses: three copies (laptop, local SSD, VPS), two media types (SSD, remote magnetic), one off-site (VPS, plus the monthly cold drive at a relative's). 3-2-1 satisfied, and the only manual step is mounting the drives. Each individual command is short and tested.

To validate quarterly: pick a directory at random, extract it from the most recent full tarball into /tmp/restore-test/, and diff -r against the live copy. Expect a clean diff. The first time you find a discrepancy, fix the script — better now than the night after the laptop dies.

Continue exploring

Tags