Agent Testing Strategies: Unit, Integration, and End-to-End Testing for AI Systems
Comprehensive testing strategies for AI agents -unit tests for components, integration tests for workflows, E2E tests for full systems, with mocking patterns and CI/CD integration.

TL;DR
- Three test levels: Unit (individual components), Integration (component interactions), E2E (full agent workflows).
- Unit tests: Fast, deterministic, test logic without LLM calls. Mock LLM responses.
- Integration tests: Test agent + external services (APIs, databases). Use staging environment.
- E2E tests: Test complete user workflow. Slow, expensive, but catches real issues.
- Mocking: Replace LLM calls with fixed responses for fast, cheap tests. 95% of tests should use mocks.
- LLM-as-judge: For E2E tests, use another LLM to evaluate output quality (can't use exact string matching).
- CI/CD: Run unit tests on every commit (seconds), integration tests on PR (minutes), E2E tests nightly (hours).
- Real data: Teams with comprehensive testing deploy 2.5× more frequently with 60% fewer production bugs.
# Agent Testing Strategies
Without testing:
Engineer: "I updated the prompt"
Deploy to production
User: "Agent is broken!"
Engineer: "Oops, didn't test that"With testing:
Engineer: "I updated the prompt"
Run tests → 12/15 tests fail
Engineer: "Found the issue, fixing..."
Run tests → 15/15 pass
Deploy with confidenceTesting Pyramid for AI Agents
Traditional software testing pyramid:
E2E (few)
Integration (some)
Unit (many)AI agent testing pyramid (same structure):
E2E: Full workflow (10 tests, run nightly)
Integration: Agent + services (50 tests, run on PR)
Unit: Components (200 tests, run every commit)Why pyramid shape: Unit tests fast/cheap (run constantly), E2E tests slow/expensive (run sparingly).
"Agent orchestration is where the real value lives. Individual AI capabilities matter less than how well you coordinate them into coherent workflows." - James Park, Founder of AI Infrastructure Labs
Level 1: Unit Tests
What: Test individual components in isolation (parsers, validators, formatters, tool functions).
Goal: Fast, deterministic, no external dependencies.
Example: Test Tool Function
# tool: search_database.py
def search_database(query: str, limit: int = 10) -> list:
"""Search database for records matching query"""
if not query or len(query) < 3:
raise ValueError("Query must be at least 3 characters")
if limit < 1 or limit > 100:
raise ValueError("Limit must be between 1 and 100")
# Execute database search
results = db.execute(f"SELECT * FROM records WHERE content LIKE '%{query}%' LIMIT {limit}")
return results
# test_search_database.py
import pytest
from unittest.mock import Mock, patch
def test_search_database_valid_query():
"""Test search with valid query returns results"""
with patch('tools.db.execute') as mock_db:
mock_db.return_value = [{"id": 1, "content": "test result"}]
results = search_database("test", limit=10)
assert len(results) == 1
assert results[0]["content"] == "test result"
mock_db.assert_called_once()
def test_search_database_short_query():
"""Test search rejects query < 3 chars"""
with pytest.raises(ValueError, match="at least 3 characters"):
search_database("ab")
def test_search_database_invalid_limit():
"""Test search rejects invalid limit"""
with pytest.raises(ValueError, match="between 1 and 100"):
search_database("test", limit=200)Run time: <1 second for 100 unit tests.
Mock LLM Responses
Problem: LLM calls slow, expensive, non-deterministic.
Solution: Mock responses in unit tests.
# agent.py
class CustomerSupportAgent:
async def classify_ticket(self, ticket_text):
"""Classify support ticket into category"""
response = await call_llm(
f"Classify this ticket: {ticket_text}\nCategories: billing, technical, account",
model="gpt-3.5-turbo"
)
return json.loads(response)
# test_agent.py
@pytest.mark.asyncio
async def test_classify_ticket_billing():
"""Test classification of billing ticket"""
agent = CustomerSupportAgent()
# Mock LLM response
with patch('agent.call_llm') as mock_llm:
mock_llm.return_value = '{"category": "billing", "confidence": 0.95}'
result = await agent.classify_ticket("My payment failed")
assert result["category"] == "billing"
assert result["confidence"] == 0.95
mock_llm.assert_called_once()
@pytest.mark.asyncio
async def test_classify_ticket_technical():
"""Test classification of technical ticket"""
agent = CustomerSupportAgent()
with patch('agent.call_llm') as mock_llm:
mock_llm.return_value = '{"category": "technical", "confidence": 0.89}'
result = await agent.classify_ticket("App is crashing")
assert result["category"] == "technical"Benefits:
- Fast (no LLM API call)
- Free (no API costs)
- Deterministic (same input = same output)
Limitations:
- Doesn't test actual LLM behavior
- Might pass even if prompt is broken
Rule: 95% of tests should mock LLMs, 5% use real LLM calls (integration/E2E tests).
Level 2: Integration Tests
What: Test agent with real external services (databases, APIs, LLMs) in staging environment.
Goal: Catch integration issues before production.
Example: Test Agent + Database + LLM
@pytest.mark.integration # Mark as integration test
@pytest.mark.asyncio
async def test_agent_end_to_end_search():
"""Test agent can search database and format results"""
# Setup: Staging database with test data
test_db = setup_staging_db()
test_db.insert("test_record", {"id": 1, "content": "Integration test data"})
# Create agent connected to staging DB
agent = SearchAgent(database=test_db)
# Execute agent with REAL LLM call
result = await agent.execute("Find records about integration test")
# Verify
assert "integration test data" in result.lower()
assert len(result) > 50 # Agent formatted response (not just raw data)
# Cleanup
test_db.cleanup()Run time: 5-10 seconds per test (LLM call adds latency).
Cost: $0.001-0.01 per test (LLM API calls).
Test Agent Workflow
@pytest.mark.integration
@pytest.mark.asyncio
async def test_customer_support_workflow():
"""Test full support ticket workflow"""
agent = CustomerSupportAgent()
# Step 1: Classify
classification = await agent.classify_ticket("My payment failed but I was charged")
assert classification["category"] == "billing"
# Step 2: Retrieve context
context = await agent.get_customer_context(user_id="test_user_123")
assert "payment_method" in context
# Step 3: Generate response
response = await agent.generate_response(classification, context)
# Verify response quality (fuzzy matching, not exact)
assert "refund" in response.lower() or "charge" in response.lower()
assert len(response) > 100 # Substantial responseWhen to run: On pull requests, before merging to main.
Level 3: E2E (End-to-End) Tests
What: Test complete user workflow from input to final output, including all agent steps.
Goal: Verify agent works as users experience it.
Example: Multi-Step Research Agent
@pytest.mark.e2e
@pytest.mark.asyncio
@pytest.mark.slow # Mark slow tests
async def test_research_agent_full_workflow():
"""
Test research agent:
1. Receives research query
2. Searches web
3. Analyzes sources
4. Generates report
"""
agent = ResearchAgent()
# Execute full workflow (takes minutes)
report = await agent.research("What are the latest developments in quantum computing?")
# Verify report structure
assert "## Summary" in report
assert "## Key Findings" in report
assert "## Sources" in report
# Verify quality with LLM-as-judge
quality_score = await evaluate_report_quality(report)
assert quality_score >= 7/10 # At least 7/10 qualityLLM-as-Judge for E2E Tests
Problem: Can't use exact string matching (LLM outputs vary).
Solution: Use another LLM to evaluate output quality.
async def evaluate_report_quality(report: str) -> float:
"""Use GPT-4 to score report quality 1-10"""
judge_prompt = f"""
Evaluate this research report on a scale of 1-10.
Criteria:
- Accuracy: Information appears correct
- Completeness: Covers topic thoroughly
- Clarity: Well-organized and readable
- Sources: Includes credible citations
Report:
{report}
Respond with just a number 1-10.
"""
score = await call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
return float(score.strip())Reliability: LLM-as-judge agrees with humans 85-90% of the time.
Cost: Doubles test cost (2 LLM calls instead of 1).
CI/CD Integration
Continuous Integration pipeline:
# .github/workflows/test.yml
name: Agent Tests
on: [push, pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run unit tests
run: pytest tests/unit -v
# Run on every commit
# Time: 10-30 seconds
integration-tests:
runs-on: ubuntu-latest
needs: unit-tests # Only if unit tests pass
steps:
- uses: actions/checkout@v2
- name: Setup staging environment
run: ./scripts/setup-staging.sh
- name: Run integration tests
run: pytest tests/integration -v
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
# Run on pull requests
# Time: 5-10 minutes
e2e-tests:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' # Only on main branch
steps:
- uses: actions/checkout@v2
- name: Run E2E tests
run: pytest tests/e2e -v --slow
# Run nightly (cron schedule)
# Time: 30-60 minutesTest execution frequency:
- Unit: Every commit (seconds)
- Integration: Every PR (minutes)
- E2E: Nightly or pre-release (hours)
Test Data Management
Golden Datasets
Create fixed test datasets for consistent evaluation.
# tests/data/golden_dataset.json
[
{
"id": "test_001",
"input": "Analyze Q3 revenue trends",
"expected_contains": ["revenue", "Q3", "trend"],
"expected_min_length": 200,
"expected_quality_score": 7
},
{
"id": "test_002",
"input": "Summarize customer feedback from last month",
"expected_contains": ["customer", "feedback", "summary"],
"expected_min_length": 150,
"expected_quality_score": 7
}
]
# test_agent_golden_dataset.py
@pytest.mark.parametrize("test_case", load_golden_dataset())
@pytest.mark.asyncio
async def test_agent_on_golden_dataset(test_case):
"""Test agent on curated golden dataset"""
agent = AnalysisAgent()
result = await agent.analyze(test_case["input"])
# Verify expected keywords present
for keyword in test_case["expected_contains"]:
assert keyword.lower() in result.lower()
# Verify minimum length
assert len(result) >= test_case["expected_min_length"]
# Verify quality
quality = await evaluate_quality(result)
assert quality >= test_case["expected_quality_score"]Benefits:
- Consistent benchmarking
- Catch regressions (new version performs worse)
- Track improvement over time
Regression Testing
After each change, re-run golden dataset:
def test_no_regression():
"""Ensure new version performs at least as well as previous version"""
current_scores = run_golden_dataset(current_agent)
previous_scores = load_previous_scores("v1.2_scores.json")
avg_current = sum(current_scores) / len(current_scores)
avg_previous = sum(previous_scores) / len(previous_scores)
# Allow 5% degradation tolerance
assert avg_current >= avg_previous * 0.95, "Performance regression detected"Testing Best Practices
1. Test pyramid ratio: 70% unit, 25% integration, 5% E2E.
2. Mock by default: Mock LLM calls in unit/integration tests, use real LLM only in E2E.
3. Test failure modes:
def test_agent_handles_api_timeout():
"""Verify agent handles API timeout gracefully"""
agent = Agent()
with patch('agent.call_llm', side_effect=TimeoutError):
result = agent.execute("test")
# Should return error message, not crash
assert "error" in result.lower()
assert "timeout" in result.lower()4. Test edge cases:
- Empty input
- Very long input (exceeds context window)
- Invalid JSON responses from LLM
- External API down
5. Parameterized tests for multiple scenarios:
@pytest.mark.parametrize("ticket,expected_category", [
("Payment failed", "billing"),
("App crashes on startup", "technical"),
("Can't reset password", "account"),
("Upgrade to pro plan", "sales")
])
def test_classify_various_tickets(ticket, expected_category):
result = classify_ticket(ticket)
assert result["category"] == expected_categoryMeasuring Test Coverage
# Run tests with coverage report
pytest --cov=agents --cov-report=html
# View coverage
open htmlcov/index.htmlTarget coverage:
- Unit tests: >90% code coverage
- Integration tests: >70% workflow coverage
- E2E tests: >50% user journey coverage
Frequently Asked Questions
How do I test non-deterministic LLM outputs?
Approaches:
- Fuzzy matching: Check keywords present, not exact string
- LLM-as-judge: Use another LLM to evaluate quality
- Seed/temperature=0: Force deterministic outputs (not always available)
- Statistical testing: Run 10 times, verify >80% pass threshold
Should I test prompts?
Yes. Prompt changes can break agents.
def test_prompt_produces_valid_json():
"""Verify prompt reliably produces parseable JSON"""
for _ in range(10): # Run 10 times (account for variability)
response = call_llm(classification_prompt, temperature=0)
# Should be valid JSON
try:
parsed = json.loads(response)
assert "category" in parsed
except json.JSONDecodeError:
pytest.fail("Prompt produced invalid JSON")How often should I run E2E tests?
Recommendation:
- Nightly (automated)
- Before every release (manual trigger)
- After major changes (on-demand)
Don't run on every commit (too slow/expensive).
---
Bottom line: Comprehensive testing requires unit (fast, many), integration (medium, some), E2E (slow, few) tests. Mock LLM calls in 95% of tests for speed. Use LLM-as-judge for E2E quality evaluation. Run unit tests on every commit, integration on PRs, E2E nightly. Teams with systematic testing deploy 2.5× more frequently with 60% fewer bugs.
Next: Read our Agent Evaluation guide for performance measurement strategies.
More from the blog
OpenHelm vs runCLAUDErun: Which Claude Code Scheduler Is Right for You?
A direct comparison of the two most popular Claude Code schedulers, how each works, what each costs, and which fits your workflow.
Claude Code vs Cursor Pro: Real Developer Cost Comparison
An honest look at what developers actually spend on Claude Code, Cursor Pro, and GitHub Copilot, and how to get the most from each.
Stop doing the work around the work
OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.