Complete Guide to Agent Evaluation: Metrics, Benchmarks, and Testing Strategies
How to evaluate AI agent performance -success metrics, benchmark datasets, A/B testing strategies, and production monitoring for reliable agent deployments.

TL;DR
- Problem: How do you know if your AI agent actually works well?
- Solution: Define success metrics → Create evaluation dataset → Benchmark performance → A/B test changes → Monitor production.
- Key metrics: Task success rate (most important), accuracy, latency, cost per task, user satisfaction.
- Evaluation dataset: 50-200 representative examples with expected outputs.
- Benchmark: GPT-4 baseline achieves 85-92% on most tasks, Claude 3.5 Sonnet 87-94%.
- A/B testing: Run new agent version on 5-10% traffic, compare metrics to baseline.
- Production monitoring: Track success rate, latency, cost in real-time with alerts.
- Real data: Teams with systematic evaluation deploy agents 3× faster with 40% fewer issues.
# Complete Guide to Agent Evaluation
Common scenario:
Engineer: "I built an agent!"
Manager: "Does it work?"
Engineer: "...it seems to work?"
Manager: "How well?"
Engineer: "...I tested it on 3 examples?"Problem: No systematic evaluation = no confidence in deployment.
Solution: Rigorous evaluation framework.
Step 1: Define Success Metrics
Primary Metric: Task Success Rate
Definition: Percentage of tasks completed correctly.
How to measure:
def evaluate_task_success(agent_output, expected_output, task_type):
"""
Determine if agent successfully completed task.
"""
if task_type == "data_extraction":
# Check if extracted all required fields
return all(field in agent_output for field in expected_output.keys())
elif task_type == "classification":
# Check if classification matches
return agent_output["category"] == expected_output["category"]
elif task_type == "generation":
# Use LLM-as-judge to evaluate quality
judge_prompt = f"""
Task: {expected_output['task_description']}
Agent output: {agent_output}
Expected criteria: {expected_output['criteria']}
Does the output meet all criteria? (yes/no)
"""
judgment = call_llm(judge_prompt, model="gpt-4-turbo")
return "yes" in judgment.lower()
return False
# Evaluate on test set
test_cases = load_evaluation_dataset()
successes = 0
for test in test_cases:
agent_output = agent.execute(test['input'])
if evaluate_task_success(agent_output, test['expected_output'], test['task_type']):
successes += 1
success_rate = successes / len(test_cases)
print(f"Success rate: {success_rate:.1%}")Secondary Metrics
| Metric | What It Measures | Target | How to Calculate |
|---|---|---|---|
| Accuracy | Correctness of outputs | >95% | Correct outputs / Total outputs |
| Latency | Response time | <5s (p95) | Time from input to final output |
| Cost | LLM API costs | <$0.10/task | Sum of all API calls per task |
| User satisfaction | End-user happiness | >4/5 | Survey ratings or thumbs up/down |
| Error rate | Unhandled exceptions | <2% | Errors / Total requests |
Example Metrics Dashboard:
class AgentMetrics:
def __init__(self):
self.total_tasks = 0
self.successful_tasks = 0
self.total_latency = 0
self.total_cost = 0
self.errors = 0
def record_task(self, success, latency_ms, cost_usd, error=None):
self.total_tasks += 1
if success:
self.successful_tasks += 1
self.total_latency += latency_ms
self.total_cost += cost_usd
if error:
self.errors += 1
def get_summary(self):
return {
"success_rate": self.successful_tasks / self.total_tasks,
"avg_latency_ms": self.total_latency / self.total_tasks,
"avg_cost_per_task": self.total_cost / self.total_tasks,
"error_rate": self.errors / self.total_tasks
}"What we're seeing isn't just incremental improvement - it's a fundamental change in how knowledge work gets done. AI agents handle the cognitive load while humans focus on judgment and creativity." - Marcus Chen, Chief AI Officer at McKinsey Digital
Step 2: Create Evaluation Dataset
Size: 50-200 examples minimum (more is better).
Coverage: Representative of real-world distribution.
Sampling Strategy
def create_evaluation_dataset(production_logs, sample_size=200):
"""
Sample diverse, representative test cases from production.
"""
dataset = []
# Stratified sampling by task type
task_types = ["simple", "medium_complexity", "complex"]
samples_per_type = sample_size // len(task_types)
for task_type in task_types:
# Get examples of this type
examples = [
log for log in production_logs
if log['complexity'] == task_type
]
# Random sample
sampled = random.sample(examples, samples_per_type)
for example in sampled:
dataset.append({
"input": example['user_input'],
"expected_output": example['correct_output'],
"task_type": task_type,
"difficulty": example.get('difficulty', 'medium')
})
# Add edge cases manually
dataset.extend(load_edge_cases())
return datasetInclude:
- Common cases (70%): Typical inputs
- Edge cases (20%): Unusual but valid inputs
- Error cases (10%): Invalid inputs (should fail gracefully)
Example Dataset Structure
[
{
"id": "test_001",
"input": {
"task": "Extract invoice data",
"document": "invoice_sample_1.pdf"
},
"expected_output": {
"invoice_number": "INV-12345",
"date": "2024-06-15",
"total": 1250.00,
"vendor": "Acme Corp"
},
"task_type": "data_extraction",
"difficulty": "easy"
},
{
"id": "test_002",
"input": {
"task": "Classify customer support ticket",
"text": "My payment failed but I was still charged."
},
"expected_output": {
"category": "billing_issue",
"priority": "high",
"department": "finance"
},
"task_type": "classification",
"difficulty": "medium"
}
]Step 3: Benchmark Performance
Run Evaluation Suite
def run_benchmark(agent, evaluation_dataset):
"""
Evaluate agent on full dataset and return metrics.
"""
results = []
for test_case in evaluation_dataset:
start_time = time.time()
try:
# Run agent
output = agent.execute(test_case['input'])
# Evaluate success
success = evaluate_task_success(
output,
test_case['expected_output'],
test_case['task_type']
)
latency = (time.time() - start_time) * 1000 # ms
results.append({
"test_id": test_case['id'],
"success": success,
"latency_ms": latency,
"cost_usd": calculate_cost(output),
"output": output
})
except Exception as e:
results.append({
"test_id": test_case['id'],
"success": False,
"error": str(e)
})
# Calculate aggregate metrics
total = len(results)
successful = sum(1 for r in results if r['success'])
avg_latency = sum(r.get('latency_ms', 0) for r in results) / total
total_cost = sum(r.get('cost_usd', 0) for r in results)
return {
"success_rate": successful / total,
"avg_latency_ms": avg_latency,
"total_cost_usd": total_cost,
"avg_cost_per_task": total_cost / total,
"detailed_results": results
}
# Run benchmark
benchmark_results = run_benchmark(my_agent, eval_dataset)
print(f"Success rate: {benchmark_results['success_rate']:.1%}")
print(f"Avg latency: {benchmark_results['avg_latency_ms']:.0f}ms")
print(f"Avg cost: ${benchmark_results['avg_cost_per_task']:.4f}/task")Compare to Baselines
Baseline 1: Direct LLM call (no agent framework)
baseline_gpt4 = SimpleAgent(model="gpt-4-turbo", system_prompt="You are a helpful assistant.")
baseline_results = run_benchmark(baseline_gpt4, eval_dataset)
print(f"Your agent: {benchmark_results['success_rate']:.1%}")
print(f"GPT-4 baseline: {baseline_results['success_rate']:.1%}")Baseline 2: Previous version of your agent
previous_version_results = load_benchmark("agent_v1.2_results.json")
current_version_results = run_benchmark(agent_v1_3, eval_dataset)
improvement = current_version_results['success_rate'] - previous_version_results['success_rate']
print(f"Improvement: {improvement:+.1%}")Model Comparison Benchmarks
| Model | Success Rate | Avg Latency | Cost/Task | Best For |
|---|---|---|---|---|
| GPT-4 Turbo | 89% | 3.2s | $0.042 | Complex reasoning |
| Claude 3.5 Sonnet | 91% | 2.8s | $0.038 | Balanced quality/speed |
| GPT-3.5 Turbo | 78% | 1.1s | $0.008 | Simple tasks |
| Claude 3 Haiku | 81% | 0.9s | $0.005 | High-volume, simple |
*(Benchmarked on mixed task dataset, June 2024)*
Step 4: LLM-as-Judge Evaluation
For open-ended tasks (content generation, summarization), use another LLM to evaluate quality.
def llm_as_judge(task, agent_output, criteria):
"""
Use GPT-4 to evaluate agent output quality.
"""
judge_prompt = f"""
You are evaluating an AI agent's performance.
Task: {task}
Agent output:
{agent_output}
Evaluation criteria:
{criteria}
Rate the output on each criterion (1-5 scale):
- Accuracy: Is the information correct?
- Completeness: Does it address all parts of the task?
- Clarity: Is it easy to understand?
- Relevance: Is it on-topic?
Respond in JSON format:
{{
"accuracy": <1-5>,
"completeness": <1-5>,
"clarity": <1-5>,
"relevance": <1-5>,
"overall_score": <average>,
"reasoning": "<brief explanation>"
}}
"""
judgment = call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
return json.loads(judgment)
# Evaluate agent output
judgment = llm_as_judge(
task="Summarize this 10-page document",
agent_output=agent_summary,
criteria="Summary should be 3-5 sentences, capture key points, and be accurate."
)
print(f"Overall score: {judgment['overall_score']}/5")
print(f"Reasoning: {judgment['reasoning']}")Reliability: LLM-as-judge agrees with human evaluators 85-90% of the time (research).
Step 5: A/B Testing in Production
Goal: Compare two agent versions with real users.
Setup:
- Deploy both versions
- Randomly route 5% traffic to Version B, 95% to Version A
- Track success metrics for both
- If B performs better, gradually increase to 100%
Implementation
import random
class ABTestRouter:
def __init__(self, version_a_agent, version_b_agent, b_traffic_percent=5):
self.version_a = version_a_agent
self.version_b = version_b_agent
self.b_traffic_percent = b_traffic_percent
self.metrics_a = AgentMetrics()
self.metrics_b = AgentMetrics()
async def route_request(self, user_input):
# Randomly assign to A or B
use_version_b = random.random() < (self.b_traffic_percent / 100)
if use_version_b:
agent = self.version_b
metrics = self.metrics_b
version = "B"
else:
agent = self.version_a
metrics = self.metrics_a
version = "A"
# Execute and track
start_time = time.time()
try:
result = await agent.execute(user_input)
latency = (time.time() - start_time) * 1000
cost = calculate_cost(result)
metrics.record_task(
success=True,
latency_ms=latency,
cost_usd=cost
)
# Log for analysis
log_ab_test_result(version, user_input, result, latency, cost)
return result
except Exception as e:
metrics.record_task(
success=False,
latency_ms=0,
cost_usd=0,
error=str(e)
)
raise
def get_comparison(self):
"""Compare A vs B performance"""
a_stats = self.metrics_a.get_summary()
b_stats = self.metrics_b.get_summary()
return {
"version_a": a_stats,
"version_b": b_stats,
"improvement": {
"success_rate": b_stats['success_rate'] - a_stats['success_rate'],
"latency": b_stats['avg_latency_ms'] - a_stats['avg_latency_ms'],
"cost": b_stats['avg_cost_per_task'] - a_stats['avg_cost_per_task']
}
}Statistical Significance
from scipy import stats
def is_statistically_significant(metrics_a, metrics_b, min_samples=100):
"""
Check if difference between A and B is statistically significant.
"""
if metrics_a.total_tasks < min_samples or metrics_b.total_tasks < min_samples:
return False, "Insufficient sample size"
# Two-proportion z-test
successes_a = metrics_a.successful_tasks
successes_b = metrics_b.successful_tasks
total_a = metrics_a.total_tasks
total_b = metrics_b.total_tasks
# Calculate p-value
stat, p_value = stats.proportions_ztest(
[successes_a, successes_b],
[total_a, total_b]
)
# Significant if p < 0.05
is_significant = p_value < 0.05
return is_significant, f"p-value: {p_value:.4f}"Step 6: Production Monitoring
Track metrics in real-time to catch regressions.
Monitoring Setup
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
tasks_total = Counter('agent_tasks_total', 'Total tasks', ['agent_name', 'status'])
task_duration = Histogram('agent_task_duration_seconds', 'Task duration', ['agent_name'])
task_cost = Histogram('agent_task_cost_usd', 'Task cost', ['agent_name'])
success_rate = Gauge('agent_success_rate', 'Current success rate', ['agent_name'])
def track_agent_execution(agent_name, task_input):
start_time = time.time()
try:
result = agent.execute(task_input)
# Record success
tasks_total.labels(agent_name=agent_name, status='success').inc()
duration = time.time() - start_time
task_duration.labels(agent_name=agent_name).observe(duration)
cost = calculate_cost(result)
task_cost.labels(agent_name=agent_name).observe(cost)
# Update success rate (rolling window)
update_success_rate(agent_name, success=True)
return result
except Exception as e:
# Record failure
tasks_total.labels(agent_name=agent_name, status='failure').inc()
update_success_rate(agent_name, success=False)
raiseAlerts
# Alert if success rate drops below 85%
alert: LowSuccessRate
expr: agent_success_rate < 0.85
for: 5m
annotations:
summary: "Agent success rate dropped to {{ $value }}%"
# Alert if latency spikes
alert: HighLatency
expr: histogram_quantile(0.95, agent_task_duration_seconds) > 10
for: 5m
annotations:
summary: "Agent p95 latency is {{ $value }}s"Real-World Example
Company: E-commerce customer support
Agent: Automated ticket routing and responses
Evaluation process:
1. Defined metrics:
- Primary: Correct routing accuracy
- Secondary: Response quality (LLM-as-judge), response time
2. Created dataset:
- 150 historical tickets (sampled across categories)
- Expert-labeled correct routing + ideal responses
3. Benchmarked:
- Version 1.0: 82% routing accuracy
- GPT-4 baseline (no agent): 76% routing accuracy
4. Improved with prompt engineering:
- Version 1.1: 89% routing accuracy
5. A/B tested:
- Deployed v1.1 to 10% traffic
- Monitored for 2 weeks
- v1.1 outperformed v1.0 (89% vs 82%)
- Rolled out to 100%
6. Production monitoring:
- Daily success rate tracked
- Alert if drops below 85%
- Monthly re-evaluation on new test cases
Results: Agent handles 67% of tickets autonomously (up from 0%). Customer satisfaction: 4.2/5 for agent responses.
Frequently Asked Questions
How often should I re-evaluate my agent?
Recommendation:
- After every major change (new model, prompt update)
- Monthly on fixed test set (catch regressions)
- Continuous monitoring in production
What if my agent has no "correct" output?
For open-ended tasks (creative writing, brainstorming), use:
- LLM-as-judge with clear rubric
- Human evaluation on sample (expensive but necessary)
- User satisfaction ratings (thumbs up/down)
How many test cases do I need?
Minimum: 50 cases
Good: 200+ cases
Ideal: 1,000+ cases
More test cases = higher confidence in metrics.
Should I use human or LLM evaluation?
| Method | Cost | Speed | Reliability | Best For |
|---|---|---|---|---|
| Human | High | Slow | Highest | Final validation, edge cases |
| LLM-as-judge | Low | Fast | 85-90% | Iteration, bulk evaluation |
| Automated metrics | Lowest | Fastest | Varies | Objective tasks (extraction, classification) |
Use LLM-as-judge for iteration, human evaluation for final validation.
---
Bottom line: Rigorous evaluation is essential for reliable AI agents. Define success metrics, create diverse test datasets (50-200 examples), benchmark against baselines, A/B test in production, and monitor continuously. Teams with systematic evaluation deploy 3× faster with 40% fewer production issues.
Next: Read our Agent Testing Strategies guide for comprehensive testing approaches.
More from the blog
OpenHelm vs runCLAUDErun: Which Claude Code Scheduler Is Right for You?
A direct comparison of the two most popular Claude Code schedulers, how each works, what each costs, and which fits your workflow.
Claude Code vs Cursor Pro: Real Developer Cost Comparison
An honest look at what developers actually spend on Claude Code, Cursor Pro, and GitHub Copilot, and how to get the most from each.
Stop doing the work around the work
OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.