Case Study: How Ramp Automated 83% of Expense Categorization
Deep dive into Ramp's AI agent implementation -architecture, challenges, results, and lessons learned from automating expense categorization at scale.

TL;DR
- Ramp automated 83% of expense categorization using a multi-agent system, reducing finance team workload by 12 hours/week.
- Implementation timeline: 12 weeks from kickoff to production (4 weeks design, 6 weeks build, 2 weeks testing).
- Results: 96% categorization accuracy, £127K wasteful SaaS spend flagged annually, monthly close time reduced from 4-6 hours to 10 minutes.
- Architecture: Three-agent parallel execution (categorizer, department assigner, anomaly detector) with human oversight for edge cases.
- Key lesson: Started with historical data (2+ years labeled transactions) for training -accuracy improved from 76% (zero-shot) to 96% (fine-tuned).
# Case Study: How Ramp Automated 83% of Expense Categorization
Ramp processes millions of corporate card transactions monthly. Before agent automation, their finance team spent 15-20 hours weekly manually categorizing expenses -clicking dropdown menus, cross-referencing merchant names with departments, flagging unusual charges.
In Q4 2024, they deployed an AI agent system that handles 83% of this work autonomously. Here's how they did it, what went wrong, and what they learned.
The Problem
Manual expense categorization bottleneck:
- 45,000 transactions/month across 800 companies
- Average 2 minutes per transaction for complex cases (international charges, new vendors, ambiguous merchants)
- Finance team: 3 people spending 50% time on categorization
- Monthly close delayed 2-3 days waiting for categorization completion
- Errors: 8-12% miscategorization rate (audit findings)
Cost of status quo:
- 60 hours/month × £45/hour = £2,700/month labor cost
- Delayed close = delayed financial reporting
- Miscategorization = wrong budgets, tax issues
"Agent orchestration is where the real value lives. Individual AI capabilities matter less than how well you coordinate them into coherent workflows." - James Park, Founder of AI Infrastructure Labs
Solution Architecture
Ramp built a three-agent parallel execution system:
Agent 1: Expense Categorizer
- Task: Assign accounting category (software, ads, travel, meals, office, contractor, other)
- Input: Merchant name, amount, description, date
- Model: Fine-tuned GPT-4 on 50,000 labeled historical transactions
- Output: Category + confidence score
Agent 2: Department Assigner
- Task: Attribute expense to department (engineering, sales, marketing, ops)
- Input: Transaction + employee data (title, department, manager)
- Model: GPT-4 Turbo with few-shot examples
- Output: Department + reasoning
Agent 3: Anomaly Detector
- Task: Flag unusual patterns (duplicates, amount >2x median, new vendors, international charges)
- Input: Transaction + 12-month spending history
- Model: GPT-4 Turbo + rule-based checks
- Output: Anomaly flags with explanation
Orchestrator:
- Runs 3 agents in parallel (reduces latency from 6s to 2s)
- Aggregates results
- If any agent confidence <85% OR anomaly flagged → escalate to human
- Otherwise: Auto-categorize and update QuickBooks
Implementation Timeline
Week 1-4: Data Preparation & Design
- Exported 2 years of transaction history (120,000 transactions)
- Manually labeled 10,000 for validation set
- Designed agent architecture (initial plan: sequential, changed to parallel for speed)
- Selected fine-tuning vs RAG (chose fine-tuning for stable category taxonomy)
Week 5-10: Build & Training
- Fine-tuned GPT-4 on categorization (50K examples, £1,800 training cost)
- Built orchestrator logic with parallel execution
- Integrated with Ramp API and QuickBooks API
- Implemented human approval queue for low-confidence cases
Week 11-12: Testing & Iteration
- Shadow mode: Agent categorized but didn't write to QuickBooks (finance team reviewed)
- Measured accuracy: 84% initially
- Iterated on prompts and added edge case handling: 96% accuracy
- Load testing: 1,000 transactions/hour with <2s latency
Week 13: Production Rollout
- Deployed to 10 pilot customers (3,000 transactions)
- Monitored for errors, accuracy held at 96%
- Full rollout to all customers
Total: 12 weeks, £45,000 engineering cost + £2,500 training
Results (After 6 Months)
Automation Rate:
- 83% of transactions auto-categorized (37,350/month)
- 17% escalated to humans (7,650/month) - complex cases, low confidence, or anomalies
Accuracy:
- 96% categorization accuracy (vs 92% human baseline from audits)
- 99.2% department assignment accuracy
Time Savings:
- Finance team: 15 hrs/week → 3 hrs/week (reviewing escalations only)
- 12 hours/week saved = 624 hours/year
- Annual value: £28,080 (at £45/hour)
Additional Value:
- £127K wasteful spend flagged: Unused SaaS seats, duplicate tools, forgotten subscriptions
- Monthly close time: 4-6 hours → 10 minutes (automated report generation)
- Error rate: 8% → 4% (fewer miscategorizations)
ROI:
- Build cost: £47,500
- Annual ongoing cost: £14,400 (API costs, maintenance)
- Annual value: £28,080 (time saved) + £127K (waste eliminated) = £155,080
- Payback: 3.7 months
Technical Insights
Why fine-tuning over RAG:
Initially considered RAG (retrieve similar past transactions, include in prompt). Chose fine-tuning because:
- Category taxonomy stable (8 categories, rarely change)
- 50K labeled examples available (strong training signal)
- Wanted low latency (<2s) - RAG adds vector search overhead
- Fine-tuned model achieved 96% vs 89% for RAG
Parallel vs sequential execution:
Originally designed sequential (categorize → assign department → detect anomalies). Changed to parallel because:
- Agents don't depend on each other's outputs
- Parallel reduced latency: 6s → 2s
- Slight implementation complexity but worth the speed gain
Human-in-the-loop design:
Tier 1 (autonomous): Confidence ≥85%, no anomalies (83% of transactions)
Tier 2 (notify): Confidence 70-85% (12% of transactions) - auto-categorize but notify finance team
Tier 3 (approve): Confidence <70% OR anomaly flagged (5% of transactions) - requires human review before categorizing
This tiered approach built trust -finance team saw agent wasn't blindly categorizing everything.
Challenges & Solutions
Challenge 1: International merchant names
Problem: Agent struggled with non-English merchant names (e.g., "株式会社ABC" instead of "ABC Corporation")
Solution: Added translation step -detect language, translate to English, then categorize
Result: Accuracy on international transactions improved from 68% to 91%
Challenge 2: Ambiguous merchants
Problem: "Amazon" could be AWS (software), Amazon Business (office supplies), or Amazon Marketplace (various)
Solution: Added category hints to prompt based on amount patterns:
- <£50 typically office supplies
- £50-500 could be office or software
- >£500 likely AWS
Also checked employee department (engineers → likely AWS, ops → likely supplies)
Result: Amazon categorization accuracy: 73% → 94%
Challenge 3: New vendor false positives
Problem: Anomaly detector flagged every new vendor as suspicious
Solution: Changed logic: Flag only if new vendor AND amount >£500
Result: False positive rate: 42% → 8%
Challenge 4: Finance team resistance
Problem: Team initially skeptical -"AI will make mistakes, I'll have to fix them anyway"
Solution:
- Ran shadow mode for 2 weeks -showed 96% accuracy matches human performance
- Positioned as "handles boring stuff, you focus on complex cases"
- No headcount reduction (redeployed to financial analysis)
Result: Full buy-in after shadow mode demonstration
Key Lessons Learned
1. Historical data is gold
Access to 2+ years labeled transactions enabled fine-tuning. Companies without historical data should start with RAG or zero-shot and gradually build labeled dataset.
2. Start with high-confidence only
Week 1 of production: Only auto-categorized transactions with ≥95% confidence (40% of volume). Gradually lowered threshold to 85% as team gained trust.
3. Anomaly detection requires domain rules
Pure LLM anomaly detection had 38% false positive rate. Hybrid approach (LLM + rule-based checks) reduced to 8%.
4. Parallel execution worth the complexity
3x speedup (6s → 2s) made user experience dramatically better. Implementation took extra 2 weeks but paid off.
5. Monthly accuracy reviews essential
Ramp reviews 100 random transactions monthly to ensure accuracy hasn't degraded. Found minor drift after 3 months (96% → 93%), retrained model, back to 96%.
Replication Guide
To implement similar system:
Requirements:
- 10,000+ historical labeled transactions (for fine-tuning) OR start with RAG
- API access to expense system (Ramp, Brex, Expensify, etc.)
- API access to accounting system (QuickBooks, Xero, NetSuite)
Timeline:
- With fine-tuning: 10-14 weeks
- With RAG: 6-8 weeks
Team:
- 1-2 engineers (full-time for 8-12 weeks)
- 1 finance lead (25% time for requirements and validation)
Cost:
- Engineering: £40K-60K
- Training (if fine-tuning): £1,500-3,000
- Ongoing API costs: £800-1,500/month (for 50K transactions)
Expected Results:
- 75-85% automation rate
- 90-96% accuracy (with iteration)
- 10-15 hours/week saved
- 3-6 month payback period
Conclusion
Ramp's expense automation agent demonstrates that AI agents can reliably handle high-volume, judgment-based workflows when implemented thoughtfully.
Key success factors:
- Sufficient training data (50K labeled examples)
- Human-in-the-loop for edge cases
- Parallel execution for performance
- Continuous monitoring and retraining
If you're considering similar automation: Start with shadow mode, measure accuracy rigorously, and expand autonomy gradually as trust builds.
The technology works. The challenge is implementation discipline.
---
Frequently Asked Questions
Q: What's the typical ROI timeline for AI agent implementations?
Most organisations see positive ROI within 3-6 months of deployment. Initial productivity gains of 20-40% are common, with improvements compounding as teams optimise prompts and workflows based on production experience.
Q: How do AI agents handle errors and edge cases?
Well-designed agent systems include fallback mechanisms, human-in-the-loop escalation, and retry logic. The key is defining clear boundaries for autonomous action versus requiring human approval for sensitive or unusual situations.
Q: How long does it take to implement an AI agent workflow?
Implementation timelines vary based on complexity, but most teams see initial results within 2-4 weeks for simple workflows. More sophisticated multi-agent systems typically require 6-12 weeks for full deployment with proper testing and governance.
More from the blog
OpenHelm vs runCLAUDErun: Which Claude Code Scheduler Is Right for You?
A direct comparison of the two most popular Claude Code schedulers, how each works, what each costs, and which fits your workflow.
Claude Code vs Cursor Pro: Real Developer Cost Comparison
An honest look at what developers actually spend on Claude Code, Cursor Pro, and GitHub Copilot, and how to get the most from each.
Stop doing the work around the work
OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.