How-to

Stop Babysitting AI Agents: Agentic Coding Workflow Automation

Why most agentic coding setups still require too much supervision — and the workflow design principles that change that.

O
OpenHelm Team· Product
··7 min read
Stop Babysitting AI Agents: Agentic Coding Workflow Automation

There was an Ask HN thread a while back titled something like "How do you cope with the broken rhythm of agentic coding?" The top comments had a theme: everyone found the same frustration. You start a Claude Code session, ask it to handle something, and then spend the next 20 minutes hovering over the terminal — glancing at the output, wondering if it's about to do something wrong, itching to step in.

That's not agentic coding. That's interactive coding with extra steps.

The promise of agentic workflows is automation: you define the goal, the agent does the work, and you review the result. What most setups deliver instead is a system that requires near-constant supervision — which is precisely the behaviour it was supposed to replace.

This post is about why that gap exists and what to do about it.

Why You're Still Babysitting

The supervision reflex isn't irrational. Most agentic coding setups do require it. Three failure modes keep pulling developers back to the terminal:

Scope drift. Claude Code finds a related issue while working on the primary goal and decides to address it. Each decision is locally reasonable. Together they turn a 15-minute task into a 90-minute session that's touched files you didn't expect it to touch. You hover because you've been burned by this before.

Silent stalling. The process is running. The output has stopped. You can't tell if it's thinking, stuck on a prompt it can't answer, or blocked on a slow network call. The meter is ticking. This is the failure mode that creates the most anxiety, because it's invisible.

Ambiguous completion. The job "finished," but did it actually do what you wanted? Without a clear verification step built into the workflow, you have to read through the output and manually assess. That assessment can take as long as the task itself, which defeats most of the productivity gain.

Each of these forces you back to the terminal. Together they make unsupervised execution feel risky. But they're all design problems, not fundamental limitations — and they're all addressable.

Designing for Autonomy

The underlying principle is that autonomous execution requires you to answer, up front, every question the agent might need to ask during the run.

That sounds like more work. In practice it's faster than reactive supervision — you spend two minutes specifying carefully before the job starts rather than 20 minutes watching over it while it runs.

Define a verifiable finish line

The most important thing you can add to any agentic workflow is a specific, checkable acceptance criterion.

Bad: "Improve the test coverage for the auth module."

Better: "Write tests for the auth module until jest --coverage --testPathPattern=auth shows ≥80% coverage. Stop once that threshold is met."

The second version gives Claude Code a way to verify its own work. It can run the coverage command, check the number, and decide whether it's done — without asking you. Without that, "done" is ambiguous, and ambiguity creates supervision.

Bound the scope explicitly

Scope drift happens when Claude Code has latitude it doesn't need. Giving it explicit negative constraints — what to leave alone — is as important as specifying what to change.

"Refactor the session handling module. Don't change any external interfaces or touch the database layer." This isn't about distrust; it's about reducing the surface area the agent needs to reason about. A narrower scope is a more reliable scope.

Build failure handling into the goal

This one is underused and undervalued. Adding a failure instruction to a goal turns expensive loops into useful reports.

"If you can't resolve this in three attempts, stop and write a DIAGNOSIS.md file summarising what you tried, what the error was, and what you'd need to proceed."

That instruction costs you nothing when the job succeeds. When it fails, instead of a four-hour loop that consumed tokens and produced nothing, you get a three-paragraph document you can read in two minutes. The diagnosis often makes the fix obvious.

Specify what Claude Code can't know

Information that isn't in the codebase belongs in the prompt. This includes:

  • Performance budgets: "This endpoint must respond in under 100ms"
  • Data sensitivity: "This service handles PII — don't add any logging of request bodies"
  • External dependencies: "The staging database is read-only; don't attempt writes"
  • Review preferences: "Open a PR with a description summarising what changed; don't push directly to main"

Anything you'd tell a new team member before letting them loose on a task is information the job prompt should include.

The Execution Environment

Good goal writing handles the design problems. The execution environment handles the operational ones.

Silence detection is the most important infrastructure feature for autonomous execution. Without it, a stalled process runs indefinitely — consuming tokens, producing nothing. A 10-minute silence threshold is the standard: if Claude Code hasn't produced any output in 10 minutes, something is likely wrong, and the run should be flagged and stopped.

Structured logging means you don't need to watch the run happen. Every run produces a complete output transcript with a clear status. You review results rather than monitor execution.

Pre-flight checks catch broken environments before Claude Code launches into them. Does the project directory exist? Is the Claude Code binary accessible? Is the job still valid? Failing fast on these saves both time and API cost.

Self-correction means failed runs don't just stop — they queue a corrective retry with the failure output included as context. Claude Code's second attempt works with information about what went wrong on the first. This handles a meaningful percentage of transient failures automatically.

OpenHelm provides all of these in a macOS desktop app designed specifically for running Claude Code jobs unattended. But the principles apply regardless of what scheduler you're using — you can implement silence detection in a cron script with a hard timeout, and structured logging with output redirection. The infrastructure doesn't need to be sophisticated; it needs to be present.

Making the Shift

The developers who have actually escaped the babysitting loop describe a consistent pattern: they stopped thinking of Claude Code as an interactive tool and started treating it as a scheduled collaborator.

That means:

  • Before a run: Write a complete goal with acceptance criteria, scope boundaries, and failure handling
  • During a run: Don't watch it. Set an alert for completion. Do something else.
  • After a run: Review the output, not the execution. Trust the structured log over your memory of the terminal

The hovering isn't just a waste of time — it actively prevents the cognitive shift that makes agentic automation valuable. The goal isn't to watch better; it's to design runs you can trust well enough to walk away from.

Most of the work is in the goal. Not in the prompt engineering sense of obsessing over wording — but in the sense of doing the five minutes of thinking up front that replaces the 20 minutes of supervision during. That's the automation.

More from the blog