Troubleshooting

6 min read

Core idea

Bugs in shell scripts come in two flavours: syntactic (the parser refuses the script) and logical (the script runs but produces the wrong result). Each demands a different toolkit. Syntactic errors are caught by reading the parser's confused error messages backward from where the shell got lost. Logical errors are caught by instrumenting the script — set -x, echo markers, ShellCheck, deliberate test cases — and by writing the script defensively in the first place so the bug categories that can occur are sharply narrowed.

Author's framing: A general rule of good programming is that if a program accepts input, it must be able to deal with anything it receives. Most logic errors result from a program encountering data the programmer did not anticipate.

Syntactic bugs point past the real error

When the shell reports an error on line 30, the cause is often on line 18. A missing quote drifts the parser forward looking for a closing one; a missing fi or ; re-interprets later tokens as arguments. The technique is to read backward from the reported line until you find the first thing that looks structurally suspicious — and to use a syntax-highlighting editor that visually distinguishes strings from code so unbalanced quotes glow on the screen.

Logical bugs come from unanticipated input

The book's canonical horror story: cd "$dir_name" && rm *. If dir_name is unset, cd does nothing, the script continues, and rm deletes the contents of the current directory. The bug is not the rm; the bug is the assumption that dir_name is non-empty. Defensive code never trusts a variable without checking it first. The general patterns:

  • check the value exists ([[ -n "$dir_name" ]]),
  • check it names a real thing ([[ -d "$dir_name" ]]),
  • check the success of any operation whose failure changes meaning (cd "$dir_name" || exit 1).

set -e is a tempting trap, not a fix

set -e is the shell's "auto-abort on any failure" switch. It looks like a defensive habit but is full of arcane exceptions — commands inside if, the non-final commands in a pipeline, subshells in POSIX mode, and several others all suppress the abort. The Bash FAQ #105 recommends against relying on it. Real error handling — explicit || guards, if checks, helpful messages — is the only thing that scales.

ShellCheck catches what humans miss

shellcheck script.sh is a static analyser that flags hundreds of common bash mistakes: unquoted variables, missing -r on read, useless cats, mismatched bracket forms, dangerous globs. It runs in milliseconds, integrates into every editor, and finds bugs that a careful human reviewer would miss on a second pass. Treat ShellCheck as the linter step in your pipeline — every script committed should pass clean (or have explicit # shellcheck disable= lines documenting why a rule is suppressed).

Tracing — set -x and friends

When a logical bug resists inspection, show the script its own execution. set -x makes bash print every command after expansion, prefixed with the value of PS4 (default: +). Common patterns:

  • enable for the whole script: #!/bin/bash -x or set -x at the top,
  • enable for a region: set -x; ...; set +x,
  • improve the trace prefix: PS4='+ ${BASH_SOURCE##*/}:${LINENO}: ' to include file and line number.

The combination of set -x and a per-variable echo "DEBUG: var=$var" >&2 will resolve the vast majority of logical bugs in under five minutes.

Why it matters

A script that handled today's data flawlessly will eat someone's home directory tomorrow because the input is now an empty string, a filename with a leading hyphen, or a directory that doesn't exist. Defensive coding is not paranoia; it is the recognition that future input is unknown. The cost of one defensive check is a line of code; the cost of one missing check can be irrecoverable data loss.

Bugs found early are cheap

The "release early, release often" principle exists because bugs become exponentially more expensive to fix the further they propagate. A bug caught the same minute the function was written is a one-line edit. The same bug caught after deployment is an incident, a post-mortem, a rollback, and an apology. Staged development (every change runnable, every commit tested) keeps bugs at the cheap end.

Defensive code documents intent

[[ -d "$dir_name" ]] || { echo "missing dir: $dir_name" >&2; exit 1; } is both a runtime check and a comment to future readers: "this script requires $dir_name to be an existing directory". A reader who removes that line is choosing to break the contract, not blindly editing. The check is its own documentation.

Empty variables are the largest single source of disasters

Almost every catastrophic shell bug — including the one above — has the form "a variable was empty and the script kept going". set -u (abort on unset variables) is one of the few "set" options worth using, and the universal habit of [[ -n "$var" ]] checks before destructive operations is what separates production-grade scripts from one-off hacks.

Key takeaways

Mental model

Mental model

Practical application

  1. Reproduce on demand before you change anything. A bug you cannot reproduce is a bug you cannot fix — you can only guess. The first investment is finding the minimal input that triggers it.

  2. Minimise the input and the script. Strip the script down to the smallest version that still fails. Strip the input down to the shortest string that still fails. The bug usually reveals itself in the act of minimising.

  3. Run ShellCheck early. Many "logical bugs" are actually well-known anti-patterns ShellCheck has a rule for. Get the linter clean before reaching for the debugger.

  4. Turn on set -x for the region in question. Don't trace the whole script if you don't need to; set -x immediately before the suspect block and set +x after gives a focused trace.

  5. Customise PS4 for serious tracing. PS4='+ ${BASH_SOURCE##*/}:${LINENO}:${FUNCNAME[0]:-main}: ' makes every traced line tell you exactly where it came from.

  6. Add echo "DEBUG: var=$var" >&2 markers when set -x isn't enough — particularly for variables that change across iterations of a loop.

  7. Bisect by commenting out. When you have no hypothesis, comment out half the script. If the bug disappears, the cause is in the half you commented out; if it doesn't, the cause is in the half you kept. Repeat.

  8. Once fixed, add a guarded test case. A one-line bash -c 'test fixture | your_script.sh; [[ $? -eq 0 ]]' in a tests/ directory beats the same bug coming back next quarter.

Example

A deploy script silently corrupted a production volume one Friday afternoon. The offending block was:

target_dir=$(get_target_for "$ENV")
cd "$target_dir"
rm -rf old/*

get_target_for was supposed to print the directory for the named environment. On Friday it printed nothing — $ENV had been renamed to $ENVIRONMENT upstream and the variable was now unset. target_dir became the empty string. cd "" is a no-op that leaves you in $PWD — which happened to be the deploy host's $HOME. rm -rf old/* then deleted everything under ~/old/ on the deploy host, which happened to be a backup snapshot directory.

The bug took an hour to identify. Walking the loop:

  1. Reproduce. Run unset ENV; ./deploy.sh staging in a sandbox — the bug appears immediately.
  2. Minimise. The bug needs four lines: the variable, the function call, the cd, the rm.
  3. Instrument. bash -x deploy.sh shows + cd '' and + rm -rf old/* running in the wrong directory.
  4. Fix. Three changes — guard the empty case, guard cd's failure, use ./ on the glob:
target_dir=$(get_target_for "$ENV")
[[ -n "$target_dir" && -d "$target_dir" ]] || {
  echo "deploy: get_target_for produced no valid directory for ENV='$ENV'" >&2
  exit 1
}
cd "$target_dir" || exit 1
rm -rf ./old/*

The fix is six lines of defensive code that prevent an entire category of failures. The test that goes in the repo immediately after: ENV= bash -c './deploy.sh; [[ $? -ne 0 ]]' — verifies that an empty ENV is now a hard error, not a silent disaster. If anyone removes the guard in a future refactor, the test will tell them.

Continue exploring

Tags