How-to

How to Stop AI Coding Agents Burning Money

The practical approach to setting budgets, enforcing limits, and understanding where your token costs actually go when running Claude Code.

Max Beech· Founder

·Mar 28, 2026·9 min read

A developer I know ran an AI coding agent overnight on what seemed like a well-scoped refactoring task. By morning, the job had made one substantive change, and spent £340 getting there. The agent had looped: tried the refactor, ran the tests, saw a type error, tried to fix it, introduced a different error, fixed that, re-introduced the original issue, and kept going for six hours. Nobody was watching. Nothing stopped it.

This isn't an unusual story. AI coding agent cost overruns are a real and growing problem, not because the tools are broken, but because the default configuration for most of them is effectively "go until you're done." Without explicit budget controls, that's an open cheque.

Here's a practical framework for putting guardrails in place before you see the invoice.

Why Agents Cost More Than You Expect

The naive model is: agent starts, agent does the task, agent stops, you're billed proportionally to the task. The actual cost curve is different.

Loops are expensive. An agent that iterates 4 times to fix a failing test costs 4× a single successful pass. An agent that iterates 40 times costs 40×. Loops happen when the goal is ambiguous, when the codebase has accumulated state the agent doesn't understand, or when the failure mode is genuinely difficult.

Context accumulation adds up. The longer a session runs, the larger the context the model processes on each turn. Cost per token is fixed, but tokens per request grows as the conversation history expands. A 4-hour session may cost disproportionately more than four 1-hour sessions doing the same total work.

Silence is invisible spending. If your agent is stalled waiting for something, a network call that timed out, an interactive prompt it can't answer, a build process that hung, the session may still be "open" and accruing charges depending on how the API billing is structured.

Scope creep compounds costs. You asked the agent to fix a bug. It read the surrounding code, noticed a quality issue, decided to address it, noticed a related dependency was outdated, and so on. Each decision is locally reasonable. Together they've turned a 15-minute job into a 2-hour session.

The Three Levers for AI Coding Agent Cost Control

1. Bound the Goal Scope Tightly

Most cost overruns start with an under-specified goal. "Improve the codebase" is an invitation to work indefinitely. "Add input validation to the five functions in src/api/handlers.ts and ensure the existing tests still pass" has a clear boundary.

Patterns that keep costs predictable:

Name specific files or directories, don't say "the API layer", say "src/api/handlers.ts and src/api/middleware.ts"
Include acceptance criteria, "job is done when npm test exits 0 with no new test skips"
Set a "don't touch" boundary, "leave the database layer and migration files alone"
Limit iteration count explicitly, "if you haven't resolved the issue in 3 attempts, stop and summarise what you tried"

That last pattern is underused. Telling Claude Code to stop after 3 failed attempts and write a diagnostic summary turns a potential £50 loop into a £3 report, which is actually more useful, since you can read it and decide on the right fix rather than inheriting a partial solution.

2. Use Silence Detection

The most reliable cost control is detecting when an agent has stopped making progress and stopping the run before it runs up further charges. This is different from detecting failure, a failed run often exits cleanly. Silence is when the process is running, output has stopped, and the meter is still ticking.

For cron-based Claude Code setups, a hard timeout catches this:

# Kill the process after 2 hours regardless
timeout 7200 claude -p "your goal" --project /path/to/project

This is crude but effective for simple cases. It kills the job at 2 hours even if Claude Code is making good progress, which is a tradeoff.

OpenHelm's approach is more granular: if no output is produced for 10 minutes, the run is flagged and stopped. This catches genuine hangs and the subtler case where Claude Code is producing output but cycling on the same error. You can see from the log that iteration 12 has the same type error as iteration 1, which is a different kind of information than "job timed out."

3. Track Actual Costs Per Job

You can't control what you can't measure. If you're running Claude Code overnight and checking results in the morning without tracking what each job cost, you're flying blind.

OpenHelm surfaces cost estimates per run in the run history. Over time, you develop a calibrated sense of what your jobs actually cost, which ones are reliably efficient, which ones have a habit of looping, and which goals tend to expand in scope.

That calibration is the most valuable cost control you have. It informs how tightly you write future goals and which jobs you're willing to run unattended vs which ones you want to watch.

Where Token Costs Actually Come From

If you want to bring costs down, it helps to know which part of the session is spending them.

Long context sessions. The context window fills up over time. Each request includes the full conversation history, which grows linearly. An agent working for 4 hours might be doing half the useful work per turn that it was doing in the first 30 minutes, the rest is processing accumulated context.

Repeated failures on the same error. If the same test is failing on every iteration, the agent is re-reading the error, re-reading the code, generating a new (but wrong) response, and billing you for each cycle. The token cost of 10 failed attempts to fix a type error can comfortably exceed the cost of a developer looking at it for 10 minutes.

Reading large codebases without guidance. If you point Claude Code at a 200k-line repository and ask a vague question, it will explore extensively before doing anything. Narrow the working directory, list the files it should care about, or tell it where to start.

The Monitoring Habit That Pays Off

The cheapest cost control is checking run logs the next morning, not because you're looking for problems, but because consistent review builds calibration.

Keep a rough sense of:

Which goals run reliably vs which ones loop
What the actual cost was vs your rough estimate going in
Which categories of work (refactors, tests, documentation) are cost-efficient vs variable

Within a few weeks, you'll have a clear picture of which job types are worth running unattended and which ones need tighter prompting, shorter time windows, or human review before they start.

A Pre-Flight Checklist Before Any Unattended Run

Before scheduling any overnight Claude Code job, run through this:

[ ] Goal has explicit acceptance criteria, "tests pass", "file exists", "PR opened"
[ ] Goal names specific files or directories, not general areas
[ ] There's a maximum run time or explicit iteration limit in the prompt
[ ] Silence detection is active (timeout flag, or OpenHelm's built-in detection)
[ ] You've rough-estimated the expected cost and it's in a range you're comfortable with
[ ] You'll check the run log in the morning before assuming success

The Honest Bottom Line

Running AI coding agents overnight is genuinely useful. The productivity gain, tasks completed while you sleep, backlogs cleared without context-switching, is real and measurable. But it works the way a contractor works: you get value when the job is well-defined and you check the output. You lose money when the scope is vague, nobody's watching, and the meter keeps running.

The developers who get consistent value from AI coding agents have built the same discipline into their workflow: write tight goals, set limits, track costs, review results. That discipline doesn't require exotic tooling, it requires the same good engineering judgement you bring to everything else.

Stop doing the work around the work

OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.

Book a demo Explore use cases

Back to Blog