Engineering

Building With Claude Code: What We Learned

Six months of integrating Claude Code as a core execution engine — what works, what doesn't, and where we're headed.

O
OpenHelm Team· Engineering
··9 min read

OpenHelm is, at its core, a desktop app that manages Claude Code. We schedule it, feed it prompts, stream its output, and surface its results. Building that layer has taught us a lot about what makes agentic coding work well — and what makes it fail.

This post is a fairly candid account of what we've learned over six months of integrating Claude Code as a first-class execution engine.

What Claude Code Is Good At

After extensive testing, we've found Claude Code excels at:

Incremental, well-scoped tasks. "Add input validation to these three functions." "Update this API client to handle rate limiting." "Write tests for the utility module." These tasks have clear success criteria and bounded scope. Claude Code's pass rate on these is very high.

Reading and understanding existing code. Before it can modify something, Claude Code reads it. It's genuinely good at this — understanding patterns, inferring intent, recognising conventions. This matters because most real-world tasks require understanding context, not just executing instructions.

Iterating on failures. When a test fails, Claude Code reads the error, looks at the relevant code, and tries a fix. It usually succeeds within 2-3 iterations. This is the capability that makes goal-based automation practical — the system can handle things that don't work the first time.

Documentation and prose. Generating docstrings, README sections, PR descriptions, commit messages. The output is usually high-quality and saves significant time.

What Claude Code Struggles With

Very large codebases without guidance. If you give Claude Code a goal and a 500k-line codebase with no context about where to start, it can spend a lot of time exploring. Narrow the scope in the job prompt.

Architectural decisions. "Refactor this to be more scalable" is not a good goal. What does "more scalable" mean? In which direction? Claude Code needs concrete, verifiable success criteria. Architectural vision is still a human job.

Tasks with external state. Goals that require interacting with third-party services (Stripe, Salesforce, your company's internal tools) are harder. Claude Code can write the code, but testing it requires real credentials and real API calls. Scope goals to not require external service interaction unless you've set up a sandbox.

Long uninterrupted sessions. We've found that jobs taking more than about 2 hours tend to drift — context accumulates and the quality of later work can drop. Our recommendation: break large goals into smaller, focused jobs with clear handoff points.

How We Structure Goals for Best Results

The goal quality tips in our other posts are based on hard-won experience. Here's what we've added to our internal guidance:

Specify the testing protocol. Don't just say "make the tests pass." Say "run npm test -- --testPathPattern=auth and make sure all tests in that suite pass." Explicit commands prevent ambiguity.

Tell it what to leave alone. "Refactor the payment module. Don't change the API interface." Negative constraints are as important as positive ones.

Give it context that isn't in the code. "This service handles PII — don't log sensitive fields." "The performance budget for this endpoint is 100ms." Information that you'd give a new team member is information the job prompt should include.

The Execution Layer

Running Claude Code headlessly — with real-time log streaming, clean process management, and failure detection — turned out to be more involved than we expected.

Claude Code runs in a streaming JSON output mode. OpenHelm reads this stream line by line, displays it in the run log view, and detects when the process has finished. We also detect when Claude Code appears to be waiting for interactive input (a common failure mode in headless execution) by monitoring for silence — if no output appears for 10 minutes, something is likely stuck.

Exit codes are the simplest reliability signal. Exit 0 means success; anything else means failure, and the failure is fed into the self-correction loop if enabled.

The Developer Experience Lesson

The most important lesson from six months of building with Claude Code: the quality of the job prompt matters as much as the quality of the execution engine.

OpenHelm can't fix a vague prompt. If you ask for "improvements," you'll get improvements — but they might not be the ones you wanted. The developers who get consistent, high-quality results are the ones who've learned to write good prompts: specific, scoped, with clear acceptance criteria.

That's a skill, and it takes practice. We're working on better tooling to help — including a correction note system that lets you accumulate guidance for specific recurring failures, and an AI planning layer that structures your high-level goals into well-scoped job prompts automatically.

What's Next

The roadmap for Claude Code integration includes:

  • Multi-session parallelism (run multiple goals simultaneously on different projects)
  • Better cost transparency (per-run token tracking in the UI)
  • MCP server integration for goals that need to query external tools
  • Team sharing features in the upcoming Cloud tier

If you're building on OpenHelm and have feedback on the execution quality, we want to hear it. The GitHub Discussions board is the right place.

More from the blog