Tuesday, March 24, 2026

Designing a Test Runner for AI Agents

Why piped shell commands create zombie processes in agentic workflows

AI agents don't use your tools the way you do. They can't see progress dots streaming by. They don't notice when a command is still running. They don't feel the machine getting slow. I found seven stuck ruby bin/rails test processes on my Mac. Running for hours. Holding database connections. Burning CPU. The AI agent that spawned them had moved on long ago, oblivious. It cost me 10-15 hours across sessions before I traced the root cause.

This is what agentic coding actually looks like when you move past demos and into production work. The AI writes good code. The problem is everything around it — the shell, the test runner, the output buffering, the process lifecycle. Here's what broke, and what I built to fix it.

Three Problems Disguised as One

What looked like "tests are slow" was actually three independent issues compounding.

1. The tool auto-backgrounds commands. Claude Code's Bash tool has an internal timeout heuristic. When a command approaches the threshold (~2 minutes by default), the tool silently moves it to a background task. The AI doesn't get the output. It doesn't know the command is still running. It sees nothing, and moves on — sometimes spawning another test run.

2. Pipe buffering triggers it. bin/rails test | grep "runs," — a perfectly reasonable command for a human — buffers the output through the pipe. The tool sees no output flowing and decides the command is long-running. Backgrounded. The human would see progress dots streaming by. The AI sees silence.

3. Parallel forking creates orphans. Rails runs tests across 16 forked processes coordinated via DRb. When the parent is killed (by the tool's timeout, by me, by a cleanup attempt), the 16 children become orphans. They keep running. They keep holding connections. One stuck run is invisible. Seven brings the machine to its knees.

The Dead Ends

set -e in shell scripts. Our git hook ran the full test suite and kept deadlocking. I thought set -e was propagating into forked workers — a child dying on a benign non-zero exit. Removed it. Still deadlocked. The real issue: Rails DRb parallel forking doesn't work inside git hook subprocesses at all. The hook's restricted environment (modified file descriptors, GIT_DIR set) prevents DRb from binding its coordination socket.

Forcing sequential execution. Added PARALLEL_WORKERS=1 to force single-process mode in hooks. Worked — but tests took 3 minutes instead of 23 seconds, and sequential mode surfaced 10 test isolation bugs that parallel mode had been hiding (each parallel worker gets its own database; sequential shares one). Spent hours chasing failures that weren't real bugs.

Writing integration tests for the fix. I had a timezone display bug where Time.zone wasn't being set in certain controllers. Wrote an integration test that made HTTP requests through the full Rails stack. The test spun at 100% CPU forever — the auto-auth flow created a redirect loop in the test environment. The actual fix was three lines of code. The test that caught the regression was a unit test that took 0.8 seconds.

What I Actually Built

A wrapper script called bin/test. About 90 lines of bash.

# Step 1: Kill any test processes older than 5 minutes
# Step 2: Run tests, capture full output to /tmp/fabwise_test_output.txt
# Step 3: Print only the summary line (or failures if any)
# Step 4: Hard timeout at 5 minutes
# Step 5: EXIT trap to clean up forked workers if interrupted

The process cleanup is straightforward — any rails test older than 5 minutes is definitively stuck (the full suite takes 23 seconds). Kill it before starting a new run.

The interesting part is the output management. A 7,500-test suite produces thousands of lines — progress dots, seed data, deprecation warnings, parallel worker logs. If that all flows to stdout, two things happen:

  1. The AI agent reads every line, burning context window tokens on noise
  2. The volume of output can trigger the auto-backgrounding heuristic

The fix: capture everything to a temp file, print only the result.

$ bin/test
✓ 7599 runs, 21379 assertions, 0 failures, 0 errors, 3 skips

One line. On failure, it prints the summary plus the relevant error messages:

$ bin/test
✗ Tests failed (full output: /tmp/fabwise_test_output.txt)

7599 runs, 21379 assertions, 1 failure, 0 errors, 3 skips

Failure:
RobotsControllerTest#test_returns_disallow_for_staging_domain
Expected "..." to include "Disallow: /".

Just enough to debug. The full log is in the temp file if you need it. Overwritten on every run — no accumulation.

This turned out to matter as much as the process cleanup. An AI agent's context window is finite and expensive. Every line of test dots is a line that can't be used for reasoning about the actual code change. The wrapper doesn't just manage processes — it manages what the AI pays attention to.

The Undocumented Configuration

The Bash tool's auto-backgrounding is driven by a 2-minute default timeout. It's configurable, but you won't find this in any official documentation.

In ~/.claude/settings.json:

{
  "env": {
    "BASH_DEFAULT_TIMEOUT_MS": "600000",
    "BASH_MAX_TIMEOUT_MS": "600000"
  }
}

This sets the default to 10 minutes (the hard ceiling — cannot be extended). Requires a full Claude Code restart. Works inconsistently across environments.

Separately, the run_in_background flag has a worse bug: it creates infinite system-reminder loops that accumulate tokens at 11.6x the normal rate. Don't use it.

The wrapper script remains essential even with the timeout configured. Defense in depth.

What I Changed About How I Write Tests

The zombie process problem forced me to rethink test design for an environment where the operator can't see the terminal.

Unit tests over integration tests for behavior verification. An integration test that boots the full Rails stack can hang if any part of the auth flow misbehaves. A unit test that simulates the same before_action directly runs in 0.8 seconds and can't hang. The AI doesn't need the HTTP stack to verify that Time.zone gets set — it needs a fast, deterministic assertion.

Never pipe test output. | grep is natural for humans. For AI agents, the pipe buffers output, triggers backgrounding, and the agent never sees the result. Run the command, capture the output, parse it after.

Document the rules where the AI reads them. Project instructions (CLAUDE.md), persistent memory, and the wrapper script itself all say "use bin/test, never bin/rails test." Subagents get PARALLEL_WORKERS=1 because they run in restricted subprocesses where parallel forking deadlocks. These aren't suggestions — they're load-bearing infrastructure.

The Pattern

Every one of these problems followed the same shape:

  1. The AI does something reasonable
  2. The tool responds in a way nobody anticipated
  3. The failure is silent — no crash, no error, just nothing happening
  4. Time is lost waiting for a human to notice

The fix is never "make the AI smarter." It's always "make the environment resilient to how the AI actually behaves." Wrapper scripts with cleanup logic. Output capture that protects the context window. Test designs that can't produce infinite loops. Configuration knobs buried in GitHub issues.

None of this is glamorous. But if you're using AI agents for real development work — not demos, not toy projects — this is where the hours go.


References: Bash timeout ceiling (GitHub #25881) · Timeout configuration (GitHub #5615) · Background process token bug (GitHub #11716) · Shell session detection (GitHub #3505)

Without Expectation book

Book & App — Launching September 2026

Without Expectation

Debugging Life's Complex Systems

The same systematic approach engineers use to debug complex systems — applied to the complex system of your life. Learn to observe without judgment, distinguish symptoms from root causes, and run small experiments that compound into massive change.

  • 23 chapters
  • AI prompt templates
  • iOS companion app
  • Print, digital & audio