Academy

Case Study: How Ramp Automated 83% of Expense Categorization

Deep dive into Ramp's AI agent implementation -architecture, challenges, results, and lessons learned from automating expense categorization at scale.

M
Max Beech· Founder
··9 min read
Case Study: How Ramp Automated 83% of Expense Categorization

TL;DR

  • Ramp automated 83% of expense categorization using a multi-agent system, reducing finance team workload by 12 hours/week.
  • Implementation timeline: 12 weeks from kickoff to production (4 weeks design, 6 weeks build, 2 weeks testing).
  • Results: 96% categorization accuracy, £127K wasteful SaaS spend flagged annually, monthly close time reduced from 4-6 hours to 10 minutes.
  • Architecture: Three-agent parallel execution (categorizer, department assigner, anomaly detector) with human oversight for edge cases.
  • Key lesson: Started with historical data (2+ years labeled transactions) for training -accuracy improved from 76% (zero-shot) to 96% (fine-tuned).

# Case Study: How Ramp Automated 83% of Expense Categorization

Ramp processes millions of corporate card transactions monthly. Before agent automation, their finance team spent 15-20 hours weekly manually categorizing expenses -clicking dropdown menus, cross-referencing merchant names with departments, flagging unusual charges.

In Q4 2024, they deployed an AI agent system that handles 83% of this work autonomously. Here's how they did it, what went wrong, and what they learned.

The Problem

Manual expense categorization bottleneck:

  • 45,000 transactions/month across 800 companies
  • Average 2 minutes per transaction for complex cases (international charges, new vendors, ambiguous merchants)
  • Finance team: 3 people spending 50% time on categorization
  • Monthly close delayed 2-3 days waiting for categorization completion
  • Errors: 8-12% miscategorization rate (audit findings)

Cost of status quo:

  • 60 hours/month × £45/hour = £2,700/month labor cost
  • Delayed close = delayed financial reporting
  • Miscategorization = wrong budgets, tax issues

"Agent orchestration is where the real value lives. Individual AI capabilities matter less than how well you coordinate them into coherent workflows." - James Park, Founder of AI Infrastructure Labs

Solution Architecture

Ramp built a three-agent parallel execution system:

Agent 1: Expense Categorizer

  • Task: Assign accounting category (software, ads, travel, meals, office, contractor, other)
  • Input: Merchant name, amount, description, date
  • Model: Fine-tuned GPT-4 on 50,000 labeled historical transactions
  • Output: Category + confidence score

Agent 2: Department Assigner

  • Task: Attribute expense to department (engineering, sales, marketing, ops)
  • Input: Transaction + employee data (title, department, manager)
  • Model: GPT-4 Turbo with few-shot examples
  • Output: Department + reasoning

Agent 3: Anomaly Detector

  • Task: Flag unusual patterns (duplicates, amount >2x median, new vendors, international charges)
  • Input: Transaction + 12-month spending history
  • Model: GPT-4 Turbo + rule-based checks
  • Output: Anomaly flags with explanation

Orchestrator:

  • Runs 3 agents in parallel (reduces latency from 6s to 2s)
  • Aggregates results
  • If any agent confidence <85% OR anomaly flagged → escalate to human
  • Otherwise: Auto-categorize and update QuickBooks

Implementation Timeline

Week 1-4: Data Preparation & Design

  • Exported 2 years of transaction history (120,000 transactions)
  • Manually labeled 10,000 for validation set
  • Designed agent architecture (initial plan: sequential, changed to parallel for speed)
  • Selected fine-tuning vs RAG (chose fine-tuning for stable category taxonomy)

Week 5-10: Build & Training

  • Fine-tuned GPT-4 on categorization (50K examples, £1,800 training cost)
  • Built orchestrator logic with parallel execution
  • Integrated with Ramp API and QuickBooks API
  • Implemented human approval queue for low-confidence cases

Week 11-12: Testing & Iteration

  • Shadow mode: Agent categorized but didn't write to QuickBooks (finance team reviewed)
  • Measured accuracy: 84% initially
  • Iterated on prompts and added edge case handling: 96% accuracy
  • Load testing: 1,000 transactions/hour with <2s latency

Week 13: Production Rollout

  • Deployed to 10 pilot customers (3,000 transactions)
  • Monitored for errors, accuracy held at 96%
  • Full rollout to all customers

Total: 12 weeks, £45,000 engineering cost + £2,500 training

Results (After 6 Months)

Automation Rate:

  • 83% of transactions auto-categorized (37,350/month)
  • 17% escalated to humans (7,650/month) - complex cases, low confidence, or anomalies

Accuracy:

  • 96% categorization accuracy (vs 92% human baseline from audits)
  • 99.2% department assignment accuracy

Time Savings:

  • Finance team: 15 hrs/week → 3 hrs/week (reviewing escalations only)
  • 12 hours/week saved = 624 hours/year
  • Annual value: £28,080 (at £45/hour)

Additional Value:

  • £127K wasteful spend flagged: Unused SaaS seats, duplicate tools, forgotten subscriptions
  • Monthly close time: 4-6 hours → 10 minutes (automated report generation)
  • Error rate: 8% → 4% (fewer miscategorizations)

ROI:

  • Build cost: £47,500
  • Annual ongoing cost: £14,400 (API costs, maintenance)
  • Annual value: £28,080 (time saved) + £127K (waste eliminated) = £155,080
  • Payback: 3.7 months

Technical Insights

Why fine-tuning over RAG:

Initially considered RAG (retrieve similar past transactions, include in prompt). Chose fine-tuning because:

  • Category taxonomy stable (8 categories, rarely change)
  • 50K labeled examples available (strong training signal)
  • Wanted low latency (<2s) - RAG adds vector search overhead
  • Fine-tuned model achieved 96% vs 89% for RAG

Parallel vs sequential execution:

Originally designed sequential (categorize → assign department → detect anomalies). Changed to parallel because:

  • Agents don't depend on each other's outputs
  • Parallel reduced latency: 6s → 2s
  • Slight implementation complexity but worth the speed gain

Human-in-the-loop design:

Tier 1 (autonomous): Confidence ≥85%, no anomalies (83% of transactions)

Tier 2 (notify): Confidence 70-85% (12% of transactions) - auto-categorize but notify finance team

Tier 3 (approve): Confidence <70% OR anomaly flagged (5% of transactions) - requires human review before categorizing

This tiered approach built trust -finance team saw agent wasn't blindly categorizing everything.

Challenges & Solutions

Challenge 1: International merchant names

Problem: Agent struggled with non-English merchant names (e.g., "株式会社ABC" instead of "ABC Corporation")

Solution: Added translation step -detect language, translate to English, then categorize

Result: Accuracy on international transactions improved from 68% to 91%

Challenge 2: Ambiguous merchants

Problem: "Amazon" could be AWS (software), Amazon Business (office supplies), or Amazon Marketplace (various)

Solution: Added category hints to prompt based on amount patterns:

  • <£50 typically office supplies
  • £50-500 could be office or software
  • >£500 likely AWS

Also checked employee department (engineers → likely AWS, ops → likely supplies)

Result: Amazon categorization accuracy: 73% → 94%

Challenge 3: New vendor false positives

Problem: Anomaly detector flagged every new vendor as suspicious

Solution: Changed logic: Flag only if new vendor AND amount >£500

Result: False positive rate: 42% → 8%

Challenge 4: Finance team resistance

Problem: Team initially skeptical -"AI will make mistakes, I'll have to fix them anyway"

Solution:

  • Ran shadow mode for 2 weeks -showed 96% accuracy matches human performance
  • Positioned as "handles boring stuff, you focus on complex cases"
  • No headcount reduction (redeployed to financial analysis)

Result: Full buy-in after shadow mode demonstration

Key Lessons Learned

1. Historical data is gold

Access to 2+ years labeled transactions enabled fine-tuning. Companies without historical data should start with RAG or zero-shot and gradually build labeled dataset.

2. Start with high-confidence only

Week 1 of production: Only auto-categorized transactions with ≥95% confidence (40% of volume). Gradually lowered threshold to 85% as team gained trust.

3. Anomaly detection requires domain rules

Pure LLM anomaly detection had 38% false positive rate. Hybrid approach (LLM + rule-based checks) reduced to 8%.

4. Parallel execution worth the complexity

3x speedup (6s → 2s) made user experience dramatically better. Implementation took extra 2 weeks but paid off.

5. Monthly accuracy reviews essential

Ramp reviews 100 random transactions monthly to ensure accuracy hasn't degraded. Found minor drift after 3 months (96% → 93%), retrained model, back to 96%.

Replication Guide

To implement similar system:

Requirements:

  • 10,000+ historical labeled transactions (for fine-tuning) OR start with RAG
  • API access to expense system (Ramp, Brex, Expensify, etc.)
  • API access to accounting system (QuickBooks, Xero, NetSuite)

Timeline:

  • With fine-tuning: 10-14 weeks
  • With RAG: 6-8 weeks

Team:

  • 1-2 engineers (full-time for 8-12 weeks)
  • 1 finance lead (25% time for requirements and validation)

Cost:

  • Engineering: £40K-60K
  • Training (if fine-tuning): £1,500-3,000
  • Ongoing API costs: £800-1,500/month (for 50K transactions)

Expected Results:

  • 75-85% automation rate
  • 90-96% accuracy (with iteration)
  • 10-15 hours/week saved
  • 3-6 month payback period

Conclusion

Ramp's expense automation agent demonstrates that AI agents can reliably handle high-volume, judgment-based workflows when implemented thoughtfully.

Key success factors:

  • Sufficient training data (50K labeled examples)
  • Human-in-the-loop for edge cases
  • Parallel execution for performance
  • Continuous monitoring and retraining

If you're considering similar automation: Start with shadow mode, measure accuracy rigorously, and expand autonomy gradually as trust builds.

The technology works. The challenge is implementation discipline.

---

Frequently Asked Questions

Q: What's the typical ROI timeline for AI agent implementations?

Most organisations see positive ROI within 3-6 months of deployment. Initial productivity gains of 20-40% are common, with improvements compounding as teams optimise prompts and workflows based on production experience.

Q: How do AI agents handle errors and edge cases?

Well-designed agent systems include fallback mechanisms, human-in-the-loop escalation, and retry logic. The key is defining clear boundaries for autonomous action versus requiring human approval for sensitive or unusual situations.

Q: How long does it take to implement an AI agent workflow?

Implementation timelines vary based on complexity, but most teams see initial results within 2-4 weeks for simple workflows. More sophisticated multi-agent systems typically require 6-12 weeks for full deployment with proper testing and governance.

More from the blog

Stop doing the work around the work

OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.