Academy

Complete Guide to Agent Evaluation: Metrics, Benchmarks, and Testing Strategies

How to evaluate AI agent performance -success metrics, benchmark datasets, A/B testing strategies, and production monitoring for reliable agent deployments.

M
Max Beech· Founder
··11 min read
Complete Guide to Agent Evaluation: Metrics, Benchmarks, and Testing Strategies

TL;DR

  • Problem: How do you know if your AI agent actually works well?
  • Solution: Define success metrics → Create evaluation dataset → Benchmark performance → A/B test changes → Monitor production.
  • Key metrics: Task success rate (most important), accuracy, latency, cost per task, user satisfaction.
  • Evaluation dataset: 50-200 representative examples with expected outputs.
  • Benchmark: GPT-4 baseline achieves 85-92% on most tasks, Claude 3.5 Sonnet 87-94%.
  • A/B testing: Run new agent version on 5-10% traffic, compare metrics to baseline.
  • Production monitoring: Track success rate, latency, cost in real-time with alerts.
  • Real data: Teams with systematic evaluation deploy agents 3× faster with 40% fewer issues.

# Complete Guide to Agent Evaluation

Common scenario:

Engineer: "I built an agent!"
Manager: "Does it work?"
Engineer: "...it seems to work?"
Manager: "How well?"
Engineer: "...I tested it on 3 examples?"

Problem: No systematic evaluation = no confidence in deployment.

Solution: Rigorous evaluation framework.

Step 1: Define Success Metrics

Primary Metric: Task Success Rate

Definition: Percentage of tasks completed correctly.

How to measure:

def evaluate_task_success(agent_output, expected_output, task_type):
    """
    Determine if agent successfully completed task.
    """
    if task_type == "data_extraction":
        # Check if extracted all required fields
        return all(field in agent_output for field in expected_output.keys())
    
    elif task_type == "classification":
        # Check if classification matches
        return agent_output["category"] == expected_output["category"]
    
    elif task_type == "generation":
        # Use LLM-as-judge to evaluate quality
        judge_prompt = f"""
        Task: {expected_output['task_description']}
        Agent output: {agent_output}
        Expected criteria: {expected_output['criteria']}
        
        Does the output meet all criteria? (yes/no)
        """
        judgment = call_llm(judge_prompt, model="gpt-4-turbo")
        return "yes" in judgment.lower()
    
    return False

# Evaluate on test set
test_cases = load_evaluation_dataset()
successes = 0

for test in test_cases:
    agent_output = agent.execute(test['input'])
    if evaluate_task_success(agent_output, test['expected_output'], test['task_type']):
        successes += 1

success_rate = successes / len(test_cases)
print(f"Success rate: {success_rate:.1%}")

Secondary Metrics

MetricWhat It MeasuresTargetHow to Calculate
AccuracyCorrectness of outputs>95%Correct outputs / Total outputs
LatencyResponse time<5s (p95)Time from input to final output
CostLLM API costs<$0.10/taskSum of all API calls per task
User satisfactionEnd-user happiness>4/5Survey ratings or thumbs up/down
Error rateUnhandled exceptions<2%Errors / Total requests

Example Metrics Dashboard:

class AgentMetrics:
    def __init__(self):
        self.total_tasks = 0
        self.successful_tasks = 0
        self.total_latency = 0
        self.total_cost = 0
        self.errors = 0
    
    def record_task(self, success, latency_ms, cost_usd, error=None):
        self.total_tasks += 1
        if success:
            self.successful_tasks += 1
        self.total_latency += latency_ms
        self.total_cost += cost_usd
        if error:
            self.errors += 1
    
    def get_summary(self):
        return {
            "success_rate": self.successful_tasks / self.total_tasks,
            "avg_latency_ms": self.total_latency / self.total_tasks,
            "avg_cost_per_task": self.total_cost / self.total_tasks,
            "error_rate": self.errors / self.total_tasks
        }

"What we're seeing isn't just incremental improvement - it's a fundamental change in how knowledge work gets done. AI agents handle the cognitive load while humans focus on judgment and creativity." - Marcus Chen, Chief AI Officer at McKinsey Digital

Step 2: Create Evaluation Dataset

Size: 50-200 examples minimum (more is better).

Coverage: Representative of real-world distribution.

Sampling Strategy

def create_evaluation_dataset(production_logs, sample_size=200):
    """
    Sample diverse, representative test cases from production.
    """
    dataset = []
    
    # Stratified sampling by task type
    task_types = ["simple", "medium_complexity", "complex"]
    samples_per_type = sample_size // len(task_types)
    
    for task_type in task_types:
        # Get examples of this type
        examples = [
            log for log in production_logs 
            if log['complexity'] == task_type
        ]
        
        # Random sample
        sampled = random.sample(examples, samples_per_type)
        
        for example in sampled:
            dataset.append({
                "input": example['user_input'],
                "expected_output": example['correct_output'],
                "task_type": task_type,
                "difficulty": example.get('difficulty', 'medium')
            })
    
    # Add edge cases manually
    dataset.extend(load_edge_cases())
    
    return dataset

Include:

  • Common cases (70%): Typical inputs
  • Edge cases (20%): Unusual but valid inputs
  • Error cases (10%): Invalid inputs (should fail gracefully)

Example Dataset Structure

[
  {
    "id": "test_001",
    "input": {
      "task": "Extract invoice data",
      "document": "invoice_sample_1.pdf"
    },
    "expected_output": {
      "invoice_number": "INV-12345",
      "date": "2024-06-15",
      "total": 1250.00,
      "vendor": "Acme Corp"
    },
    "task_type": "data_extraction",
    "difficulty": "easy"
  },
  {
    "id": "test_002",
    "input": {
      "task": "Classify customer support ticket",
      "text": "My payment failed but I was still charged."
    },
    "expected_output": {
      "category": "billing_issue",
      "priority": "high",
      "department": "finance"
    },
    "task_type": "classification",
    "difficulty": "medium"
  }
]

Step 3: Benchmark Performance

Run Evaluation Suite

def run_benchmark(agent, evaluation_dataset):
    """
    Evaluate agent on full dataset and return metrics.
    """
    results = []
    
    for test_case in evaluation_dataset:
        start_time = time.time()
        
        try:
            # Run agent
            output = agent.execute(test_case['input'])
            
            # Evaluate success
            success = evaluate_task_success(
                output,
                test_case['expected_output'],
                test_case['task_type']
            )
            
            latency = (time.time() - start_time) * 1000  # ms
            
            results.append({
                "test_id": test_case['id'],
                "success": success,
                "latency_ms": latency,
                "cost_usd": calculate_cost(output),
                "output": output
            })
        
        except Exception as e:
            results.append({
                "test_id": test_case['id'],
                "success": False,
                "error": str(e)
            })
    
    # Calculate aggregate metrics
    total = len(results)
    successful = sum(1 for r in results if r['success'])
    avg_latency = sum(r.get('latency_ms', 0) for r in results) / total
    total_cost = sum(r.get('cost_usd', 0) for r in results)
    
    return {
        "success_rate": successful / total,
        "avg_latency_ms": avg_latency,
        "total_cost_usd": total_cost,
        "avg_cost_per_task": total_cost / total,
        "detailed_results": results
    }

# Run benchmark
benchmark_results = run_benchmark(my_agent, eval_dataset)
print(f"Success rate: {benchmark_results['success_rate']:.1%}")
print(f"Avg latency: {benchmark_results['avg_latency_ms']:.0f}ms")
print(f"Avg cost: ${benchmark_results['avg_cost_per_task']:.4f}/task")

Compare to Baselines

Baseline 1: Direct LLM call (no agent framework)

baseline_gpt4 = SimpleAgent(model="gpt-4-turbo", system_prompt="You are a helpful assistant.")
baseline_results = run_benchmark(baseline_gpt4, eval_dataset)

print(f"Your agent: {benchmark_results['success_rate']:.1%}")
print(f"GPT-4 baseline: {baseline_results['success_rate']:.1%}")

Baseline 2: Previous version of your agent

previous_version_results = load_benchmark("agent_v1.2_results.json")
current_version_results = run_benchmark(agent_v1_3, eval_dataset)

improvement = current_version_results['success_rate'] - previous_version_results['success_rate']
print(f"Improvement: {improvement:+.1%}")

Model Comparison Benchmarks

ModelSuccess RateAvg LatencyCost/TaskBest For
GPT-4 Turbo89%3.2s$0.042Complex reasoning
Claude 3.5 Sonnet91%2.8s$0.038Balanced quality/speed
GPT-3.5 Turbo78%1.1s$0.008Simple tasks
Claude 3 Haiku81%0.9s$0.005High-volume, simple

*(Benchmarked on mixed task dataset, June 2024)*

Step 4: LLM-as-Judge Evaluation

For open-ended tasks (content generation, summarization), use another LLM to evaluate quality.

def llm_as_judge(task, agent_output, criteria):
    """
    Use GPT-4 to evaluate agent output quality.
    """
    judge_prompt = f"""
    You are evaluating an AI agent's performance.
    
    Task: {task}
    
    Agent output:
    {agent_output}
    
    Evaluation criteria:
    {criteria}
    
    Rate the output on each criterion (1-5 scale):
    - Accuracy: Is the information correct?
    - Completeness: Does it address all parts of the task?
    - Clarity: Is it easy to understand?
    - Relevance: Is it on-topic?
    
    Respond in JSON format:
    {{
      "accuracy": <1-5>,
      "completeness": <1-5>,
      "clarity": <1-5>,
      "relevance": <1-5>,
      "overall_score": <average>,
      "reasoning": "<brief explanation>"
    }}
    """
    
    judgment = call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
    return json.loads(judgment)

# Evaluate agent output
judgment = llm_as_judge(
    task="Summarize this 10-page document",
    agent_output=agent_summary,
    criteria="Summary should be 3-5 sentences, capture key points, and be accurate."
)

print(f"Overall score: {judgment['overall_score']}/5")
print(f"Reasoning: {judgment['reasoning']}")

Reliability: LLM-as-judge agrees with human evaluators 85-90% of the time (research).

Step 5: A/B Testing in Production

Goal: Compare two agent versions with real users.

Setup:

  1. Deploy both versions
  2. Randomly route 5% traffic to Version B, 95% to Version A
  3. Track success metrics for both
  4. If B performs better, gradually increase to 100%

Implementation

import random

class ABTestRouter:
    def __init__(self, version_a_agent, version_b_agent, b_traffic_percent=5):
        self.version_a = version_a_agent
        self.version_b = version_b_agent
        self.b_traffic_percent = b_traffic_percent
        self.metrics_a = AgentMetrics()
        self.metrics_b = AgentMetrics()
    
    async def route_request(self, user_input):
        # Randomly assign to A or B
        use_version_b = random.random() < (self.b_traffic_percent / 100)
        
        if use_version_b:
            agent = self.version_b
            metrics = self.metrics_b
            version = "B"
        else:
            agent = self.version_a
            metrics = self.metrics_a
            version = "A"
        
        # Execute and track
        start_time = time.time()
        try:
            result = await agent.execute(user_input)
            latency = (time.time() - start_time) * 1000
            cost = calculate_cost(result)
            
            metrics.record_task(
                success=True,
                latency_ms=latency,
                cost_usd=cost
            )
            
            # Log for analysis
            log_ab_test_result(version, user_input, result, latency, cost)
            
            return result
        
        except Exception as e:
            metrics.record_task(
                success=False,
                latency_ms=0,
                cost_usd=0,
                error=str(e)
            )
            raise
    
    def get_comparison(self):
        """Compare A vs B performance"""
        a_stats = self.metrics_a.get_summary()
        b_stats = self.metrics_b.get_summary()
        
        return {
            "version_a": a_stats,
            "version_b": b_stats,
            "improvement": {
                "success_rate": b_stats['success_rate'] - a_stats['success_rate'],
                "latency": b_stats['avg_latency_ms'] - a_stats['avg_latency_ms'],
                "cost": b_stats['avg_cost_per_task'] - a_stats['avg_cost_per_task']
            }
        }

Statistical Significance

from scipy import stats

def is_statistically_significant(metrics_a, metrics_b, min_samples=100):
    """
    Check if difference between A and B is statistically significant.
    """
    if metrics_a.total_tasks < min_samples or metrics_b.total_tasks < min_samples:
        return False, "Insufficient sample size"
    
    # Two-proportion z-test
    successes_a = metrics_a.successful_tasks
    successes_b = metrics_b.successful_tasks
    total_a = metrics_a.total_tasks
    total_b = metrics_b.total_tasks
    
    # Calculate p-value
    stat, p_value = stats.proportions_ztest(
        [successes_a, successes_b],
        [total_a, total_b]
    )
    
    # Significant if p < 0.05
    is_significant = p_value < 0.05
    
    return is_significant, f"p-value: {p_value:.4f}"

Step 6: Production Monitoring

Track metrics in real-time to catch regressions.

Monitoring Setup

from prometheus_client import Counter, Histogram, Gauge

# Define metrics
tasks_total = Counter('agent_tasks_total', 'Total tasks', ['agent_name', 'status'])
task_duration = Histogram('agent_task_duration_seconds', 'Task duration', ['agent_name'])
task_cost = Histogram('agent_task_cost_usd', 'Task cost', ['agent_name'])
success_rate = Gauge('agent_success_rate', 'Current success rate', ['agent_name'])

def track_agent_execution(agent_name, task_input):
    start_time = time.time()
    
    try:
        result = agent.execute(task_input)
        
        # Record success
        tasks_total.labels(agent_name=agent_name, status='success').inc()
        
        duration = time.time() - start_time
        task_duration.labels(agent_name=agent_name).observe(duration)
        
        cost = calculate_cost(result)
        task_cost.labels(agent_name=agent_name).observe(cost)
        
        # Update success rate (rolling window)
        update_success_rate(agent_name, success=True)
        
        return result
    
    except Exception as e:
        # Record failure
        tasks_total.labels(agent_name=agent_name, status='failure').inc()
        update_success_rate(agent_name, success=False)
        
        raise

Alerts

# Alert if success rate drops below 85%
alert: LowSuccessRate
expr: agent_success_rate < 0.85
for: 5m
annotations:
  summary: "Agent success rate dropped to {{ $value }}%"
  
# Alert if latency spikes
alert: HighLatency
expr: histogram_quantile(0.95, agent_task_duration_seconds) > 10
for: 5m
annotations:
  summary: "Agent p95 latency is {{ $value }}s"

Real-World Example

Company: E-commerce customer support

Agent: Automated ticket routing and responses

Evaluation process:

1. Defined metrics:

  • Primary: Correct routing accuracy
  • Secondary: Response quality (LLM-as-judge), response time

2. Created dataset:

  • 150 historical tickets (sampled across categories)
  • Expert-labeled correct routing + ideal responses

3. Benchmarked:

  • Version 1.0: 82% routing accuracy
  • GPT-4 baseline (no agent): 76% routing accuracy

4. Improved with prompt engineering:

  • Version 1.1: 89% routing accuracy

5. A/B tested:

  • Deployed v1.1 to 10% traffic
  • Monitored for 2 weeks
  • v1.1 outperformed v1.0 (89% vs 82%)
  • Rolled out to 100%

6. Production monitoring:

  • Daily success rate tracked
  • Alert if drops below 85%
  • Monthly re-evaluation on new test cases

Results: Agent handles 67% of tickets autonomously (up from 0%). Customer satisfaction: 4.2/5 for agent responses.

Frequently Asked Questions

How often should I re-evaluate my agent?

Recommendation:

  • After every major change (new model, prompt update)
  • Monthly on fixed test set (catch regressions)
  • Continuous monitoring in production

What if my agent has no "correct" output?

For open-ended tasks (creative writing, brainstorming), use:

  • LLM-as-judge with clear rubric
  • Human evaluation on sample (expensive but necessary)
  • User satisfaction ratings (thumbs up/down)

How many test cases do I need?

Minimum: 50 cases

Good: 200+ cases

Ideal: 1,000+ cases

More test cases = higher confidence in metrics.

Should I use human or LLM evaluation?

MethodCostSpeedReliabilityBest For
HumanHighSlowHighestFinal validation, edge cases
LLM-as-judgeLowFast85-90%Iteration, bulk evaluation
Automated metricsLowestFastestVariesObjective tasks (extraction, classification)

Use LLM-as-judge for iteration, human evaluation for final validation.

---

Bottom line: Rigorous evaluation is essential for reliable AI agents. Define success metrics, create diverse test datasets (50-200 examples), benchmark against baselines, A/B test in production, and monitor continuously. Teams with systematic evaluation deploy 3× faster with 40% fewer production issues.

Next: Read our Agent Testing Strategies guide for comprehensive testing approaches.

More from the blog

Stop doing the work around the work

OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.