Academy

Complete Guide to Agent Evaluation: Metrics, Benchmarks, and Testing Strategies

How to evaluate AI agent performance -success metrics, benchmark datasets, A/B testing strategies, and production monitoring for reliable agent deployments.

Max Beech· Founder

·Jun 20, 2024·11 min read

TL;DR

Problem: How do you know if your AI agent actually works well?
Solution: Define success metrics → Create evaluation dataset → Benchmark performance → A/B test changes → Monitor production.
Key metrics: Task success rate (most important), accuracy, latency, cost per task, user satisfaction.
Evaluation dataset: 50-200 representative examples with expected outputs.
Benchmark: GPT-4 baseline achieves 85-92% on most tasks, Claude 3.5 Sonnet 87-94%.
A/B testing: Run new agent version on 5-10% traffic, compare metrics to baseline.
Production monitoring: Track success rate, latency, cost in real-time with alerts.
Real data: Teams with systematic evaluation deploy agents 3× faster with 40% fewer issues.

# Complete Guide to Agent Evaluation

Common scenario:

Engineer: "I built an agent!"
Manager: "Does it work?"
Engineer: "...it seems to work?"
Manager: "How well?"
Engineer: "...I tested it on 3 examples?"

Problem: No systematic evaluation = no confidence in deployment.

Solution: Rigorous evaluation framework.

Step 1: Define Success Metrics

Primary Metric: Task Success Rate

Definition: Percentage of tasks completed correctly.

How to measure:

def evaluate_task_success(agent_output, expected_output, task_type):
    """
    Determine if agent successfully completed task.
    """
    if task_type == "data_extraction":
        # Check if extracted all required fields
        return all(field in agent_output for field in expected_output.keys())
    
    elif task_type == "classification":
        # Check if classification matches
        return agent_output["category"] == expected_output["category"]
    
    elif task_type == "generation":
        # Use LLM-as-judge to evaluate quality
        judge_prompt = f"""
        Task: {expected_output['task_description']}
        Agent output: {agent_output}
        Expected criteria: {expected_output['criteria']}
        
        Does the output meet all criteria? (yes/no)
        """
        judgment = call_llm(judge_prompt, model="gpt-4-turbo")
        return "yes" in judgment.lower()
    
    return False

# Evaluate on test set
test_cases = load_evaluation_dataset()
successes = 0

for test in test_cases:
    agent_output = agent.execute(test['input'])
    if evaluate_task_success(agent_output, test['expected_output'], test['task_type']):
        successes += 1

success_rate = successes / len(test_cases)
print(f"Success rate: {success_rate:.1%}")

Secondary Metrics

Metric	What It Measures	Target	How to Calculate
Accuracy	Correctness of outputs	>95%	Correct outputs / Total outputs
Latency	Response time	<5s (p95)	Time from input to final output
Cost	LLM API costs	<$0.10/task	Sum of all API calls per task
User satisfaction	End-user happiness	>4/5	Survey ratings or thumbs up/down
Error rate	Unhandled exceptions	<2%	Errors / Total requests

Example Metrics Dashboard:

class AgentMetrics:
    def __init__(self):
        self.total_tasks = 0
        self.successful_tasks = 0
        self.total_latency = 0
        self.total_cost = 0
        self.errors = 0
    
    def record_task(self, success, latency_ms, cost_usd, error=None):
        self.total_tasks += 1
        if success:
            self.successful_tasks += 1
        self.total_latency += latency_ms
        self.total_cost += cost_usd
        if error:
            self.errors += 1
    
    def get_summary(self):
        return {
            "success_rate": self.successful_tasks / self.total_tasks,
            "avg_latency_ms": self.total_latency / self.total_tasks,
            "avg_cost_per_task": self.total_cost / self.total_tasks,
            "error_rate": self.errors / self.total_tasks
        }

"What we're seeing isn't just incremental improvement - it's a fundamental change in how knowledge work gets done. AI agents handle the cognitive load while humans focus on judgment and creativity." - Marcus Chen, Chief AI Officer at McKinsey Digital

Step 2: Create Evaluation Dataset

Size: 50-200 examples minimum (more is better).

Coverage: Representative of real-world distribution.

Sampling Strategy

def create_evaluation_dataset(production_logs, sample_size=200):
    """
    Sample diverse, representative test cases from production.
    """
    dataset = []
    
    # Stratified sampling by task type
    task_types = ["simple", "medium_complexity", "complex"]
    samples_per_type = sample_size // len(task_types)
    
    for task_type in task_types:
        # Get examples of this type
        examples = [
            log for log in production_logs 
            if log['complexity'] == task_type
        ]
        
        # Random sample
        sampled = random.sample(examples, samples_per_type)
        
        for example in sampled:
            dataset.append({
                "input": example['user_input'],
                "expected_output": example['correct_output'],
                "task_type": task_type,
                "difficulty": example.get('difficulty', 'medium')
            })
    
    # Add edge cases manually
    dataset.extend(load_edge_cases())
    
    return dataset

Include:

Common cases (70%): Typical inputs
Edge cases (20%): Unusual but valid inputs
Error cases (10%): Invalid inputs (should fail gracefully)

Example Dataset Structure

[
  {
    "id": "test_001",
    "input": {
      "task": "Extract invoice data",
      "document": "invoice_sample_1.pdf"
    },
    "expected_output": {
      "invoice_number": "INV-12345",
      "date": "2024-06-15",
      "total": 1250.00,
      "vendor": "Acme Corp"
    },
    "task_type": "data_extraction",
    "difficulty": "easy"
  },
  {
    "id": "test_002",
    "input": {
      "task": "Classify customer support ticket",
      "text": "My payment failed but I was still charged."
    },
    "expected_output": {
      "category": "billing_issue",
      "priority": "high",
      "department": "finance"
    },
    "task_type": "classification",
    "difficulty": "medium"
  }
]

Step 3: Benchmark Performance

Run Evaluation Suite

def run_benchmark(agent, evaluation_dataset):
    """
    Evaluate agent on full dataset and return metrics.
    """
    results = []
    
    for test_case in evaluation_dataset:
        start_time = time.time()
        
        try:
            # Run agent
            output = agent.execute(test_case['input'])
            
            # Evaluate success
            success = evaluate_task_success(
                output,
                test_case['expected_output'],
                test_case['task_type']
            )
            
            latency = (time.time() - start_time) * 1000  # ms
            
            results.append({
                "test_id": test_case['id'],
                "success": success,
                "latency_ms": latency,
                "cost_usd": calculate_cost(output),
                "output": output
            })
        
        except Exception as e:
            results.append({
                "test_id": test_case['id'],
                "success": False,
                "error": str(e)
            })
    
    # Calculate aggregate metrics
    total = len(results)
    successful = sum(1 for r in results if r['success'])
    avg_latency = sum(r.get('latency_ms', 0) for r in results) / total
    total_cost = sum(r.get('cost_usd', 0) for r in results)
    
    return {
        "success_rate": successful / total,
        "avg_latency_ms": avg_latency,
        "total_cost_usd": total_cost,
        "avg_cost_per_task": total_cost / total,
        "detailed_results": results
    }

# Run benchmark
benchmark_results = run_benchmark(my_agent, eval_dataset)
print(f"Success rate: {benchmark_results['success_rate']:.1%}")
print(f"Avg latency: {benchmark_results['avg_latency_ms']:.0f}ms")
print(f"Avg cost: ${benchmark_results['avg_cost_per_task']:.4f}/task")

Compare to Baselines

Baseline 1: Direct LLM call (no agent framework)

baseline_gpt4 = SimpleAgent(model="gpt-4-turbo", system_prompt="You are a helpful assistant.")
baseline_results = run_benchmark(baseline_gpt4, eval_dataset)

print(f"Your agent: {benchmark_results['success_rate']:.1%}")
print(f"GPT-4 baseline: {baseline_results['success_rate']:.1%}")

Baseline 2: Previous version of your agent

previous_version_results = load_benchmark("agent_v1.2_results.json")
current_version_results = run_benchmark(agent_v1_3, eval_dataset)

improvement = current_version_results['success_rate'] - previous_version_results['success_rate']
print(f"Improvement: {improvement:+.1%}")

Model Comparison Benchmarks

Model	Success Rate	Avg Latency	Cost/Task	Best For
GPT-4 Turbo	89%	3.2s	$0.042	Complex reasoning
Claude 3.5 Sonnet	91%	2.8s	$0.038	Balanced quality/speed
GPT-3.5 Turbo	78%	1.1s	$0.008	Simple tasks
Claude 3 Haiku	81%	0.9s	$0.005	High-volume, simple

*(Benchmarked on mixed task dataset, June 2024)*

Step 4: LLM-as-Judge Evaluation

For open-ended tasks (content generation, summarization), use another LLM to evaluate quality.

def llm_as_judge(task, agent_output, criteria):
    """
    Use GPT-4 to evaluate agent output quality.
    """
    judge_prompt = f"""
    You are evaluating an AI agent's performance.
    
    Task: {task}
    
    Agent output:
    {agent_output}
    
    Evaluation criteria:
    {criteria}
    
    Rate the output on each criterion (1-5 scale):
    - Accuracy: Is the information correct?
    - Completeness: Does it address all parts of the task?
    - Clarity: Is it easy to understand?
    - Relevance: Is it on-topic?
    
    Respond in JSON format:
    {{
      "accuracy": <1-5>,
      "completeness": <1-5>,
      "clarity": <1-5>,
      "relevance": <1-5>,
      "overall_score": <average>,
      "reasoning": "<brief explanation>"
    }}
    """
    
    judgment = call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
    return json.loads(judgment)

# Evaluate agent output
judgment = llm_as_judge(
    task="Summarize this 10-page document",
    agent_output=agent_summary,
    criteria="Summary should be 3-5 sentences, capture key points, and be accurate."
)

print(f"Overall score: {judgment['overall_score']}/5")
print(f"Reasoning: {judgment['reasoning']}")

Reliability: LLM-as-judge agrees with human evaluators 85-90% of the time (research).

Step 5: A/B Testing in Production

Goal: Compare two agent versions with real users.

Setup:

Deploy both versions
Randomly route 5% traffic to Version B, 95% to Version A
Track success metrics for both
If B performs better, gradually increase to 100%

Implementation

import random

class ABTestRouter:
    def __init__(self, version_a_agent, version_b_agent, b_traffic_percent=5):
        self.version_a = version_a_agent
        self.version_b = version_b_agent
        self.b_traffic_percent = b_traffic_percent
        self.metrics_a = AgentMetrics()
        self.metrics_b = AgentMetrics()
    
    async def route_request(self, user_input):
        # Randomly assign to A or B
        use_version_b = random.random() < (self.b_traffic_percent / 100)
        
        if use_version_b:
            agent = self.version_b
            metrics = self.metrics_b
            version = "B"
        else:
            agent = self.version_a
            metrics = self.metrics_a
            version = "A"
        
        # Execute and track
        start_time = time.time()
        try:
            result = await agent.execute(user_input)
            latency = (time.time() - start_time) * 1000
            cost = calculate_cost(result)
            
            metrics.record_task(
                success=True,
                latency_ms=latency,
                cost_usd=cost
            )
            
            # Log for analysis
            log_ab_test_result(version, user_input, result, latency, cost)
            
            return result
        
        except Exception as e:
            metrics.record_task(
                success=False,
                latency_ms=0,
                cost_usd=0,
                error=str(e)
            )
            raise
    
    def get_comparison(self):
        """Compare A vs B performance"""
        a_stats = self.metrics_a.get_summary()
        b_stats = self.metrics_b.get_summary()
        
        return {
            "version_a": a_stats,
            "version_b": b_stats,
            "improvement": {
                "success_rate": b_stats['success_rate'] - a_stats['success_rate'],
                "latency": b_stats['avg_latency_ms'] - a_stats['avg_latency_ms'],
                "cost": b_stats['avg_cost_per_task'] - a_stats['avg_cost_per_task']
            }
        }

Statistical Significance

from scipy import stats

def is_statistically_significant(metrics_a, metrics_b, min_samples=100):
    """
    Check if difference between A and B is statistically significant.
    """
    if metrics_a.total_tasks < min_samples or metrics_b.total_tasks < min_samples:
        return False, "Insufficient sample size"
    
    # Two-proportion z-test
    successes_a = metrics_a.successful_tasks
    successes_b = metrics_b.successful_tasks
    total_a = metrics_a.total_tasks
    total_b = metrics_b.total_tasks
    
    # Calculate p-value
    stat, p_value = stats.proportions_ztest(
        [successes_a, successes_b],
        [total_a, total_b]
    )
    
    # Significant if p < 0.05
    is_significant = p_value < 0.05
    
    return is_significant, f"p-value: {p_value:.4f}"

Step 6: Production Monitoring

Track metrics in real-time to catch regressions.

Monitoring Setup

from prometheus_client import Counter, Histogram, Gauge

# Define metrics
tasks_total = Counter('agent_tasks_total', 'Total tasks', ['agent_name', 'status'])
task_duration = Histogram('agent_task_duration_seconds', 'Task duration', ['agent_name'])
task_cost = Histogram('agent_task_cost_usd', 'Task cost', ['agent_name'])
success_rate = Gauge('agent_success_rate', 'Current success rate', ['agent_name'])

def track_agent_execution(agent_name, task_input):
    start_time = time.time()
    
    try:
        result = agent.execute(task_input)
        
        # Record success
        tasks_total.labels(agent_name=agent_name, status='success').inc()
        
        duration = time.time() - start_time
        task_duration.labels(agent_name=agent_name).observe(duration)
        
        cost = calculate_cost(result)
        task_cost.labels(agent_name=agent_name).observe(cost)
        
        # Update success rate (rolling window)
        update_success_rate(agent_name, success=True)
        
        return result
    
    except Exception as e:
        # Record failure
        tasks_total.labels(agent_name=agent_name, status='failure').inc()
        update_success_rate(agent_name, success=False)
        
        raise

Alerts

# Alert if success rate drops below 85%
alert: LowSuccessRate
expr: agent_success_rate < 0.85
for: 5m
annotations:
  summary: "Agent success rate dropped to {{ $value }}%"
  
# Alert if latency spikes
alert: HighLatency
expr: histogram_quantile(0.95, agent_task_duration_seconds) > 10
for: 5m
annotations:
  summary: "Agent p95 latency is {{ $value }}s"

Real-World Example

Company: E-commerce customer support

Agent: Automated ticket routing and responses

Evaluation process:

1. Defined metrics:

Primary: Correct routing accuracy
Secondary: Response quality (LLM-as-judge), response time

2. Created dataset:

150 historical tickets (sampled across categories)
Expert-labeled correct routing + ideal responses

3. Benchmarked:

Version 1.0: 82% routing accuracy
GPT-4 baseline (no agent): 76% routing accuracy

4. Improved with prompt engineering:

Version 1.1: 89% routing accuracy

5. A/B tested:

Deployed v1.1 to 10% traffic
Monitored for 2 weeks
v1.1 outperformed v1.0 (89% vs 82%)
Rolled out to 100%

6. Production monitoring:

Daily success rate tracked
Alert if drops below 85%
Monthly re-evaluation on new test cases

Results: Agent handles 67% of tickets autonomously (up from 0%). Customer satisfaction: 4.2/5 for agent responses.

Frequently Asked Questions

How often should I re-evaluate my agent?

Recommendation:

After every major change (new model, prompt update)
Monthly on fixed test set (catch regressions)
Continuous monitoring in production

What if my agent has no "correct" output?

For open-ended tasks (creative writing, brainstorming), use:

LLM-as-judge with clear rubric
Human evaluation on sample (expensive but necessary)
User satisfaction ratings (thumbs up/down)

How many test cases do I need?

Minimum: 50 cases

Good: 200+ cases

Ideal: 1,000+ cases

More test cases = higher confidence in metrics.

Should I use human or LLM evaluation?

Method	Cost	Speed	Reliability	Best For
Human	High	Slow	Highest	Final validation, edge cases
LLM-as-judge	Low	Fast	85-90%	Iteration, bulk evaluation
Automated metrics	Lowest	Fastest	Varies	Objective tasks (extraction, classification)

Use LLM-as-judge for iteration, human evaluation for final validation.

---

Bottom line: Rigorous evaluation is essential for reliable AI agents. Define success metrics, create diverse test datasets (50-200 examples), benchmark against baselines, A/B test in production, and monitor continuously. Teams with systematic evaluation deploy 3× faster with 40% fewer production issues.

Next: Read our Agent Testing Strategies guide for comprehensive testing approaches.

Stop doing the work around the work

OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.

Book a demo Explore use cases

Back to Blog