Academy

Agent Testing Strategies: Unit, Integration, and End-to-End Testing for AI Systems

Comprehensive testing strategies for AI agents -unit tests for components, integration tests for workflows, E2E tests for full systems, with mocking patterns and CI/CD integration.

M
Max Beech· Founder
··10 min read
Agent Testing Strategies: Unit, Integration, and End-to-End Testing for AI Systems

TL;DR

  • Three test levels: Unit (individual components), Integration (component interactions), E2E (full agent workflows).
  • Unit tests: Fast, deterministic, test logic without LLM calls. Mock LLM responses.
  • Integration tests: Test agent + external services (APIs, databases). Use staging environment.
  • E2E tests: Test complete user workflow. Slow, expensive, but catches real issues.
  • Mocking: Replace LLM calls with fixed responses for fast, cheap tests. 95% of tests should use mocks.
  • LLM-as-judge: For E2E tests, use another LLM to evaluate output quality (can't use exact string matching).
  • CI/CD: Run unit tests on every commit (seconds), integration tests on PR (minutes), E2E tests nightly (hours).
  • Real data: Teams with comprehensive testing deploy 2.5× more frequently with 60% fewer production bugs.

# Agent Testing Strategies

Without testing:

Engineer: "I updated the prompt"
Deploy to production
User: "Agent is broken!"
Engineer: "Oops, didn't test that"

With testing:

Engineer: "I updated the prompt"
Run tests → 12/15 tests fail
Engineer: "Found the issue, fixing..."
Run tests → 15/15 pass
Deploy with confidence

Testing Pyramid for AI Agents

Traditional software testing pyramid:

      E2E (few)
   Integration (some)
  Unit (many)

AI agent testing pyramid (same structure):

      E2E: Full workflow (10 tests, run nightly)
   Integration: Agent + services (50 tests, run on PR)
  Unit: Components (200 tests, run every commit)

Why pyramid shape: Unit tests fast/cheap (run constantly), E2E tests slow/expensive (run sparingly).

"Agent orchestration is where the real value lives. Individual AI capabilities matter less than how well you coordinate them into coherent workflows." - James Park, Founder of AI Infrastructure Labs

Level 1: Unit Tests

What: Test individual components in isolation (parsers, validators, formatters, tool functions).

Goal: Fast, deterministic, no external dependencies.

Example: Test Tool Function

# tool: search_database.py
def search_database(query: str, limit: int = 10) -> list:
    """Search database for records matching query"""
    if not query or len(query) < 3:
        raise ValueError("Query must be at least 3 characters")
    
    if limit < 1 or limit > 100:
        raise ValueError("Limit must be between 1 and 100")
    
    # Execute database search
    results = db.execute(f"SELECT * FROM records WHERE content LIKE '%{query}%' LIMIT {limit}")
    return results

# test_search_database.py
import pytest
from unittest.mock import Mock, patch

def test_search_database_valid_query():
    """Test search with valid query returns results"""
    with patch('tools.db.execute') as mock_db:
        mock_db.return_value = [{"id": 1, "content": "test result"}]
        
        results = search_database("test", limit=10)
        
        assert len(results) == 1
        assert results[0]["content"] == "test result"
        mock_db.assert_called_once()

def test_search_database_short_query():
    """Test search rejects query < 3 chars"""
    with pytest.raises(ValueError, match="at least 3 characters"):
        search_database("ab")

def test_search_database_invalid_limit():
    """Test search rejects invalid limit"""
    with pytest.raises(ValueError, match="between 1 and 100"):
        search_database("test", limit=200)

Run time: <1 second for 100 unit tests.

Mock LLM Responses

Problem: LLM calls slow, expensive, non-deterministic.

Solution: Mock responses in unit tests.

# agent.py
class CustomerSupportAgent:
    async def classify_ticket(self, ticket_text):
        """Classify support ticket into category"""
        response = await call_llm(
            f"Classify this ticket: {ticket_text}\nCategories: billing, technical, account",
            model="gpt-3.5-turbo"
        )
        return json.loads(response)

# test_agent.py
@pytest.mark.asyncio
async def test_classify_ticket_billing():
    """Test classification of billing ticket"""
    agent = CustomerSupportAgent()
    
    # Mock LLM response
    with patch('agent.call_llm') as mock_llm:
        mock_llm.return_value = '{"category": "billing", "confidence": 0.95}'
        
        result = await agent.classify_ticket("My payment failed")
        
        assert result["category"] == "billing"
        assert result["confidence"] == 0.95
        mock_llm.assert_called_once()

@pytest.mark.asyncio
async def test_classify_ticket_technical():
    """Test classification of technical ticket"""
    agent = CustomerSupportAgent()
    
    with patch('agent.call_llm') as mock_llm:
        mock_llm.return_value = '{"category": "technical", "confidence": 0.89}'
        
        result = await agent.classify_ticket("App is crashing")
        
        assert result["category"] == "technical"

Benefits:

  • Fast (no LLM API call)
  • Free (no API costs)
  • Deterministic (same input = same output)

Limitations:

  • Doesn't test actual LLM behavior
  • Might pass even if prompt is broken

Rule: 95% of tests should mock LLMs, 5% use real LLM calls (integration/E2E tests).

Level 2: Integration Tests

What: Test agent with real external services (databases, APIs, LLMs) in staging environment.

Goal: Catch integration issues before production.

Example: Test Agent + Database + LLM

@pytest.mark.integration  # Mark as integration test
@pytest.mark.asyncio
async def test_agent_end_to_end_search():
    """Test agent can search database and format results"""
    # Setup: Staging database with test data
    test_db = setup_staging_db()
    test_db.insert("test_record", {"id": 1, "content": "Integration test data"})
    
    # Create agent connected to staging DB
    agent = SearchAgent(database=test_db)
    
    # Execute agent with REAL LLM call
    result = await agent.execute("Find records about integration test")
    
    # Verify
    assert "integration test data" in result.lower()
    assert len(result) > 50  # Agent formatted response (not just raw data)
    
    # Cleanup
    test_db.cleanup()

Run time: 5-10 seconds per test (LLM call adds latency).

Cost: $0.001-0.01 per test (LLM API calls).

Test Agent Workflow

@pytest.mark.integration
@pytest.mark.asyncio
async def test_customer_support_workflow():
    """Test full support ticket workflow"""
    agent = CustomerSupportAgent()
    
    # Step 1: Classify
    classification = await agent.classify_ticket("My payment failed but I was charged")
    assert classification["category"] == "billing"
    
    # Step 2: Retrieve context
    context = await agent.get_customer_context(user_id="test_user_123")
    assert "payment_method" in context
    
    # Step 3: Generate response
    response = await agent.generate_response(classification, context)
    
    # Verify response quality (fuzzy matching, not exact)
    assert "refund" in response.lower() or "charge" in response.lower()
    assert len(response) > 100  # Substantial response

When to run: On pull requests, before merging to main.

Level 3: E2E (End-to-End) Tests

What: Test complete user workflow from input to final output, including all agent steps.

Goal: Verify agent works as users experience it.

Example: Multi-Step Research Agent

@pytest.mark.e2e
@pytest.mark.asyncio
@pytest.mark.slow  # Mark slow tests
async def test_research_agent_full_workflow():
    """
    Test research agent:
    1. Receives research query
    2. Searches web
    3. Analyzes sources
    4. Generates report
    """
    agent = ResearchAgent()
    
    # Execute full workflow (takes minutes)
    report = await agent.research("What are the latest developments in quantum computing?")
    
    # Verify report structure
    assert "## Summary" in report
    assert "## Key Findings" in report
    assert "## Sources" in report
    
    # Verify quality with LLM-as-judge
    quality_score = await evaluate_report_quality(report)
    assert quality_score >= 7/10  # At least 7/10 quality

LLM-as-Judge for E2E Tests

Problem: Can't use exact string matching (LLM outputs vary).

Solution: Use another LLM to evaluate output quality.

async def evaluate_report_quality(report: str) -> float:
    """Use GPT-4 to score report quality 1-10"""
    judge_prompt = f"""
    Evaluate this research report on a scale of 1-10.
    
    Criteria:
    - Accuracy: Information appears correct
    - Completeness: Covers topic thoroughly
    - Clarity: Well-organized and readable
    - Sources: Includes credible citations
    
    Report:
    {report}
    
    Respond with just a number 1-10.
    """
    
    score = await call_llm(judge_prompt, model="gpt-4-turbo", temperature=0)
    return float(score.strip())

Reliability: LLM-as-judge agrees with humans 85-90% of the time.

Cost: Doubles test cost (2 LLM calls instead of 1).

CI/CD Integration

Continuous Integration pipeline:

# .github/workflows/test.yml
name: Agent Tests

on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Run unit tests
        run: pytest tests/unit -v
    # Run on every commit
    # Time: 10-30 seconds
  
  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests  # Only if unit tests pass
    steps:
      - uses: actions/checkout@v2
      - name: Setup staging environment
        run: ./scripts/setup-staging.sh
      - name: Run integration tests
        run: pytest tests/integration -v
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    # Run on pull requests
    # Time: 5-10 minutes
  
  e2e-tests:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'  # Only on main branch
    steps:
      - uses: actions/checkout@v2
      - name: Run E2E tests
        run: pytest tests/e2e -v --slow
    # Run nightly (cron schedule)
    # Time: 30-60 minutes

Test execution frequency:

  • Unit: Every commit (seconds)
  • Integration: Every PR (minutes)
  • E2E: Nightly or pre-release (hours)

Test Data Management

Golden Datasets

Create fixed test datasets for consistent evaluation.

# tests/data/golden_dataset.json
[
  {
    "id": "test_001",
    "input": "Analyze Q3 revenue trends",
    "expected_contains": ["revenue", "Q3", "trend"],
    "expected_min_length": 200,
    "expected_quality_score": 7
  },
  {
    "id": "test_002",
    "input": "Summarize customer feedback from last month",
    "expected_contains": ["customer", "feedback", "summary"],
    "expected_min_length": 150,
    "expected_quality_score": 7
  }
]

# test_agent_golden_dataset.py
@pytest.mark.parametrize("test_case", load_golden_dataset())
@pytest.mark.asyncio
async def test_agent_on_golden_dataset(test_case):
    """Test agent on curated golden dataset"""
    agent = AnalysisAgent()
    
    result = await agent.analyze(test_case["input"])
    
    # Verify expected keywords present
    for keyword in test_case["expected_contains"]:
        assert keyword.lower() in result.lower()
    
    # Verify minimum length
    assert len(result) >= test_case["expected_min_length"]
    
    # Verify quality
    quality = await evaluate_quality(result)
    assert quality >= test_case["expected_quality_score"]

Benefits:

  • Consistent benchmarking
  • Catch regressions (new version performs worse)
  • Track improvement over time

Regression Testing

After each change, re-run golden dataset:

def test_no_regression():
    """Ensure new version performs at least as well as previous version"""
    current_scores = run_golden_dataset(current_agent)
    previous_scores = load_previous_scores("v1.2_scores.json")
    
    avg_current = sum(current_scores) / len(current_scores)
    avg_previous = sum(previous_scores) / len(previous_scores)
    
    # Allow 5% degradation tolerance
    assert avg_current >= avg_previous * 0.95, "Performance regression detected"

Testing Best Practices

1. Test pyramid ratio: 70% unit, 25% integration, 5% E2E.

2. Mock by default: Mock LLM calls in unit/integration tests, use real LLM only in E2E.

3. Test failure modes:

def test_agent_handles_api_timeout():
    """Verify agent handles API timeout gracefully"""
    agent = Agent()
    
    with patch('agent.call_llm', side_effect=TimeoutError):
        result = agent.execute("test")
        
        # Should return error message, not crash
        assert "error" in result.lower()
        assert "timeout" in result.lower()

4. Test edge cases:

  • Empty input
  • Very long input (exceeds context window)
  • Invalid JSON responses from LLM
  • External API down

5. Parameterized tests for multiple scenarios:

@pytest.mark.parametrize("ticket,expected_category", [
    ("Payment failed", "billing"),
    ("App crashes on startup", "technical"),
    ("Can't reset password", "account"),
    ("Upgrade to pro plan", "sales")
])
def test_classify_various_tickets(ticket, expected_category):
    result = classify_ticket(ticket)
    assert result["category"] == expected_category

Measuring Test Coverage

# Run tests with coverage report
pytest --cov=agents --cov-report=html

# View coverage
open htmlcov/index.html

Target coverage:

  • Unit tests: >90% code coverage
  • Integration tests: >70% workflow coverage
  • E2E tests: >50% user journey coverage

Frequently Asked Questions

How do I test non-deterministic LLM outputs?

Approaches:

  1. Fuzzy matching: Check keywords present, not exact string
  2. LLM-as-judge: Use another LLM to evaluate quality
  3. Seed/temperature=0: Force deterministic outputs (not always available)
  4. Statistical testing: Run 10 times, verify >80% pass threshold

Should I test prompts?

Yes. Prompt changes can break agents.

def test_prompt_produces_valid_json():
    """Verify prompt reliably produces parseable JSON"""
    for _ in range(10):  # Run 10 times (account for variability)
        response = call_llm(classification_prompt, temperature=0)
        
        # Should be valid JSON
        try:
            parsed = json.loads(response)
            assert "category" in parsed
        except json.JSONDecodeError:
            pytest.fail("Prompt produced invalid JSON")

How often should I run E2E tests?

Recommendation:

  • Nightly (automated)
  • Before every release (manual trigger)
  • After major changes (on-demand)

Don't run on every commit (too slow/expensive).

---

Bottom line: Comprehensive testing requires unit (fast, many), integration (medium, some), E2E (slow, few) tests. Mock LLM calls in 95% of tests for speed. Use LLM-as-judge for E2E quality evaluation. Run unit tests on every commit, integration on PRs, E2E nightly. Teams with systematic testing deploy 2.5× more frequently with 60% fewer bugs.

Next: Read our Agent Evaluation guide for performance measurement strategies.

More from the blog

Stop doing the work around the work

OpenHelm connects to your tools, reads the context, and does the steps, so you sign off on the result instead of producing it. See how it covers an entire role’s weekly workload, check the pricing, or run it yourself with the free local app.