How to Evaluate AI Agents: The Missing Guide for Product Engineers

# How to Evaluate AI Agents: The Missing Guide for Product Engineers **Meta description:** Learn how to evaluate AI agents with practical frameworks, scoring methodologies, and code examples. The complete guide to AI agent evals for product engineers. **SEO targets:** how to evaluate AI agents, AI agent evals, LLM evaluation framework --- Everyone's shipping AI agents. Almost nobody is evaluating them properly. You've seen the pattern: a team builds a customer support agent, demos it to leadership, gets applause, ships it to production — and three weeks later, users are complaining about hallucinated refund policies and conversations that go in circles. The gap isn't in *building* agents. It's in *evaluating* them. This guide is the missing manual. Whether you're a product engineer, an AI PM, or someone who just inherited an agent that's already in production, this is how you build an evaluation system that actually works. ## Why Agent Evals Are Different from Model Evals If you've worked with LLMs, you've probably seen benchmarks like MMLU, HELM, or Chatbot Arena. These are **model-level evaluations** — they tell you whether GPT-4o is generally smarter than Claude 3.5 Sonnet at reasoning tasks. Agent evals are fundamentally different. Here's why: | Dimension | Model Evals | Agent Evals | |-----------|------------|-------------| | What you're testing | Raw capability | End-to-end behavior | | Input/Output | Prompt → Completion | Goal → Multi-step outcome | | Determinism | Mostly deterministic | Highly stochastic | | Scope | Single turn | Multi-turn, tool use, memory | | Failure modes | Wrong answer | Wrong action, loops, hallucinated tool calls | An agent might use the right model but still fail because of bad tool orchestration, poor retrieval, or broken memory management. Model benchmarks won't catch any of that. ## The Three Layers of Agent Evaluation Think of agent evals as a pyramid: ### Layer 1: Component Evals (Unit Tests for AI) Test each piece in isolation: - **LLM quality:** Is the base model producing good completions for your prompts? - **Retrieval quality:** Is your RAG pipeline returning relevant documents? - **Tool accuracy:** When the agent calls a function, does it pass correct parameters? ### Layer 2: Trajectory Evals (Integration Tests for AI) Test the agent's decision-making path: - Did the agent take the right *sequence* of steps? - Did it use the right tools in the right order? - Did it know when to ask for clarification vs. when to act? ### Layer 3: Outcome Evals (End-to-End Tests for AI) Test whether the agent achieved the goal: - Did the customer's issue get resolved? - Was the final output correct and complete? - How long did it take? How many steps? Most teams only do Layer 3 (if they evaluate at all). The magic is in combining all three. ## A Practical Example: Evaluating "SupportBot" Let's make this concrete. You're the PM for **SupportBot**, a customer support AI agent at a B2B SaaS company. SupportBot handles: - Account questions ("What plan am I on?") - Billing issues ("Why was I charged twice?") - Feature requests ("Can you add dark mode?") - Bug reports ("The export button is broken") Here's how to build a real eval system for it. ### Step 1: Build Your Eval Dataset You need test cases. Not 5. Not 50. **At minimum 200**, spread across your agent's capabilities. ```python # eval_dataset.py eval_cases = [ { "id": "billing_001", "category": "billing", "input": "I was charged $99 but I'm on the free plan", "expected_tools": ["lookup_account", "check_billing_history"], "expected_behavior": "Verify account status, check for billing discrepancy, escalate if confirmed", "expected_outcome": "Agent identifies billing error and initiates refund OR correctly explains charge", "golden_response_keywords": ["billing", "account", "refund OR charge explanation"], "difficulty": "medium" }, { "id": "account_001", "category": "account", "input": "What plan am I on and when does it renew?", "expected_tools": ["lookup_account"], "expected_behavior": "Look up account, return plan name and renewal date", "expected_outcome": "Correct plan name and exact renewal date", "golden_response_keywords": ["plan", "renewal", "date"], "difficulty": "easy" }, # ... 198 more cases ] ``` **Pro tip:** Seed your dataset from real conversations. Pull 500 actual support tickets, categorize them, and turn the best examples into eval cases. This is 10x more valuable than synthetic data. ### Step 2: Define Your Scoring Rubric You need metrics that actually mean something. Here's the rubric I recommend: ```python # scoring.py from dataclasses import dataclass from enum import Enum class Score(Enum): FAIL = 0 PARTIAL = 1 PASS = 2 @dataclass class EvalResult: case_id: str # Layer 1: Component scores retrieval_relevance: float # 0-1, were the right docs fetched? tool_selection_accuracy: Score # Did it pick the right tools? # Layer 2: Trajectory scores step_efficiency: float # optimal_steps / actual_steps no_hallucinated_actions: bool # Did it call tools that don't exist? appropriate_escalation: bool # Did it escalate when it should have? # Layer 3: Outcome scores task_completed: bool response_quality: float # 0-1, LLM-as-judge score factual_accuracy: Score # Did it state correct facts? tone_appropriate: bool @property def composite_score(self) -> float: weights = { 'factual_accuracy': 0.25, 'task_completed': 0.25, 'response_quality': 0.20, 'tool_selection_accuracy': 0.15, 'step_efficiency': 0.10, 'tone_appropriate': 0.05, } raw = ( (self.factual_accuracy.value / 2) * weights['factual_accuracy'] + float(self.task_completed) * weights['task_completed'] + self.response_quality * weights['response_quality'] + (self.tool_selection_accuracy.value / 2) * weights['tool_selection_accuracy'] + self.step_efficiency * weights['step_efficiency'] + float(self.tone_appropriate) * weights['tone_appropriate'] ) return round(raw, 3) ``` ### Step 3: Implement LLM-as-Judge For subjective metrics (response quality, tone), use another LLM as a judge. This is the most scalable approach, but it needs calibration. ```python # llm_judge.py import json from openai import OpenAI client = OpenAI() JUDGE_PROMPT = """You are evaluating an AI customer support agent's response. ## Context Customer query: {query} Agent response: {response} Expected behavior: {expected} ## Evaluation Criteria Rate each dimension from 0.0 to 1.0: 1. **Helpfulness**: Did the response address the customer's actual need? 2. **Accuracy**: Are all stated facts correct? (0.0 if any hallucination detected) 3. **Completeness**: Did it cover everything needed, without over-explaining? 4. **Tone**: Professional, empathetic, appropriate for the situation? 5. **Actionability**: Does the customer know exactly what to do next? Return JSON: {{"helpfulness": 0.0-1.0, "accuracy": 0.0-1.0, "completeness": 0.0-1.0, "tone": 0.0-1.0, "actionability": 0.0-1.0, "reasoning": "brief explanation"}} """ def judge_response(query: str, response: str, expected: str) -> dict: result = client.chat.completions.create( model="gpt-4o", response_format={"type": "json_object"}, messages=[{ "role": "user", "content": JUDGE_PROMPT.format( query=query, response=response, expected=expected ) }], temperature=0.1 # Low temp for consistency ) return json.loads(result.choices[0].message.content) ``` **Critical:** Calibrate your judge. Run it against 50 cases you've manually scored, and check the correlation. If the LLM judge disagrees with humans more than 20% of the time, your judge prompt needs work. ### Step 4: Build the Eval Harness Now wire it all together: ```python # eval_harness.py import asyncio from datetime import datetime from typing import List from your_agent import SupportBot # your actual agent async def run_eval_suite( agent: SupportBot, cases: List[dict], run_id: str = None ) -> dict: run_id = run_id or f"eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}" results = [] for case in cases: # Run the agent trace = await agent.run( message=case["input"], trace_enabled=True # Capture tool calls, intermediate steps ) # Score components tool_accuracy = score_tool_selection( actual_tools=trace.tools_called, expected_tools=case["expected_tools"] ) # LLM judge for quality judge_scores = judge_response( query=case["input"], response=trace.final_response, expected=case["expected_outcome"] ) # Check for hallucinated tool calls valid_tools = agent.get_available_tools() hallucinated = any(t not in valid_tools for t in trace.tools_called) result = EvalResult( case_id=case["id"], retrieval_relevance=score_retrieval(trace), tool_selection_accuracy=tool_accuracy, step_efficiency=len(case["expected_tools"]) / max(len(trace.tools_called), 1), no_hallucinated_actions=not hallucinated, appropriate_escalation=check_escalation(trace, case), task_completed=judge_scores["completeness"] > 0.7, response_quality=judge_scores["helpfulness"], factual_accuracy=Score.PASS if judge_scores["accuracy"] > 0.9 else Score.FAIL, tone_appropriate=judge_scores["tone"] > 0.7, ) results.append(result) # Aggregate avg_score = sum(r.composite_score for r in results) / len(results) pass_rate = sum(1 for r in results if r.composite_score > 0.7) / len(results) return { "run_id": run_id, "total_cases": len(results), "avg_composite_score": round(avg_score, 3), "pass_rate": f"{pass_rate:.1%}", "by_category": aggregate_by_category(results, cases), "worst_cases": sorted(results, key=lambda r: r.composite_score)[:10], "results": results, } ``` ### Step 5: Run It in CI/CD The eval is only useful if it runs automatically. Here's a basic GitHub Actions setup: ```yaml # .github/workflows/agent-eval.yml name: Agent Eval Suite on: pull_request: paths: - 'agent/**' - 'prompts/**' - 'tools/**' schedule: - cron: '0 6 * * *' # Daily at 6am UTC jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: '3.12' - run: pip install -r requirements.txt - name: Run eval suite env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python -m evals.run --suite full --output results.json - name: Check regression run: | python -m evals.check_regression \ --current results.json \ --baseline evals/baseline.json \ --max-regression 0.05 - name: Post results to PR if: github.event_name == 'pull_request' uses: actions/github-script@v7 with: script: | const results = require('./results.json'); const body = `## 🤖 Agent Eval Results | Metric | Score | |--------|-------| | Composite | ${results.avg_composite_score} | | Pass Rate | ${results.pass_rate} | | Factual Accuracy | ${results.factual_accuracy_rate} | | Regression | ${results.regression_detected ? '⚠️ YES' : '✅ None'} | `; github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body }); ``` ## Offline vs. Online Evals Everything above is **offline evaluation** — you run it against a fixed dataset before deployment. But you also need **online evaluation** — monitoring the agent in production. ### Offline Evals (Pre-deployment) - ✅ Controlled, reproducible - ✅ Can test edge cases and adversarial inputs - ✅ Blocks bad changes from shipping - ❌ Doesn't capture real user behavior - ❌ Dataset may not reflect production distribution ### Online Evals (Post-deployment) - ✅ Real user interactions - ✅ Catches distribution shift and novel inputs - ✅ Measures actual business outcomes - ❌ By definition, users see failures - ❌ Harder to attribute regressions **You need both.** The ratio depends on your risk tolerance. Customer support? Heavy offline evals + conservative deploy gates. Internal productivity tool? Lighter offline, heavier online monitoring. ### Online Monitoring Essentials ```typescript // monitor.ts — lightweight production eval interface AgentTrace { sessionId: string; query: string; response: string; toolCalls: ToolCall[]; latencyMs: number; tokenCount: number; timestamp: Date; } interface QualitySignals { // Implicit signals (no user effort) conversationLength: number; // Long = possibly struggling toolCallFailures: number; // Failed API calls selfCorrections: number; // Agent contradicted itself escalatedToHuman: boolean; // Had to bail out // Explicit signals (user feedback) thumbsUp: boolean | null; csatScore: number | null; // 1-5 } function detectAnomalies(traces: AgentTrace[], window: string = '1h'): Alert[] { const alerts: Alert[] = []; const recent = filterByWindow(traces, window); // Hallucination proxy: response references tools/data the agent doesn't have const hallRate = recent.filter(t => detectHallucination(t)).length / recent.length; if (hallRate > 0.05) { alerts.push({ severity: 'high', type: 'hallucination_spike', rate: hallRate }); } // Latency regression const p95Latency = percentile(recent.map(t => t.latencyMs), 95); if (p95Latency > 10000) { alerts.push({ severity: 'medium', type: 'latency_regression', p95: p95Latency }); } // Escalation rate spike const escalationRate = recent.filter(t => t.escalatedToHuman).length / recent.length; if (escalationRate > 0.3) { alerts.push({ severity: 'high', type: 'escalation_spike', rate: escalationRate }); } return alerts; } ``` ## Hallucination Monitoring Hallucinations are the #1 risk in production agents. Three practical approaches: ### 1. Factual Grounding Check After the agent responds, run a verification pass: does every factual claim in the response trace back to a retrieved document or tool output? ### 2. Consistency Check Ask the same question 3 times with slight paraphrasing. If the agent gives materially different answers, something's wrong. ### 3. Entailment Scoring Use an NLI (Natural Language Inference) model to check if the agent's response is *entailed by* its retrieved context. If the response says things the context doesn't support, flag it. ```python # hallucination_check.py def check_grounding(response: str, sources: list[str]) -> float: """Score 0-1 for how well the response is grounded in sources.""" result = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": f"""Given these source documents: {chr(10).join(sources)} And this agent response: {response} Score from 0.0 to 1.0: what fraction of claims in the response are directly supported by the sources? Return just the number.""" }], temperature=0 ) return float(result.choices[0].message.content.strip()) ``` ## Model Drift Detection Your agent will degrade over time, even if you change nothing. Why? - **Model provider updates:** OpenAI/Anthropic quietly update models - **Data drift:** User questions shift (seasonal, product changes) - **Context drift:** Your RAG documents get stale ### Canary Eval Pattern Run a fixed set of 50 "canary" test cases every day. Track scores over time. If the 7-day moving average drops by more than 5%, trigger an alert. ```python # canary.py def run_canary(canary_cases: list, history_file: str = "canary_history.jsonl"): today_scores = [] for case in canary_cases: result = run_single_eval(case) today_scores.append(result.composite_score) avg = sum(today_scores) / len(today_scores) # Append to history with open(history_file, "a") as f: f.write(json.dumps({"date": str(date.today()), "avg_score": avg}) + "\n") # Check 7-day trend history = load_history(history_file) if len(history) >= 7: recent_avg = sum(h["avg_score"] for h in history[-7:]) / 7 baseline_avg = sum(h["avg_score"] for h in history[-30:-7]) / max(len(history[-30:-7]), 1) if baseline_avg - recent_avg > 0.05: send_alert(f"⚠️ Agent quality regression: {baseline_avg:.3f} → {recent_avg:.3f}") ``` ## The Eval Frameworks Landscape Here's what's out there and when to use each: | Framework | Best For | Approach | |-----------|----------|----------| | **Chatbot Arena (LMSYS)** | Comparing base models | Crowdsourced human preference (Elo ratings) | | **HELM** (Stanford) | Holistic model assessment | Multi-metric benchmark across scenarios | | **DeepEval** | Agent & LLM eval in CI/CD | Python framework, LLM-as-judge, 14+ metrics | | **LangSmith** | LangChain-based agents | Tracing + eval built into LangChain ecosystem | | **Langfuse** | Open-source observability | Tracing, scoring, prompt management | | **Braintrust** | Production LLM apps | Logging, eval, prompt playground | | **Opik** (Comet) | Open-source LLM eval | Tracing, automated scoring, CI integration | **My recommendation for most teams:** Start with DeepEval or Langfuse for structure, but build your domain-specific scoring rubric from scratch. No off-the-shelf framework knows that "SupportBot should never promise a refund without checking the billing system first." ## Regression Testing for AI: The Non-Obvious Parts Traditional regression testing is binary: it works or it doesn't. AI regression testing is probabilistic. Here's how to handle it: ### 1. Statistical Significance Don't fail a PR because one score dipped by 0.01. Use a paired t-test or bootstrap confidence interval to determine if the regression is statistically significant. ### 2. Category-Level Regression Your overall score might stay flat while one category tanks. Always break results down by category, difficulty level, and tool dependency. ### 3. The Baseline Problem What's "good enough"? Set your baseline from a human-evaluated golden set. Have 3 humans rate 100 agent responses, take the average — that's your target. If the agent scores within 0.05 of human quality, it ships. ### 4. Version Pinning Always pin your eval against a specific model version, tool version, and prompt version. When something regresses, you need to know *which* change caused it. ## Putting It All Together: The Eval Maturity Model **Level 0 — Vibes:** "It seems to work well in demos." (Most teams are here.) **Level 1 — Manual Spot Checks:** Someone reviews 20 conversations per week. **Level 2 — Automated Offline Evals:** Eval suite runs in CI, blocks regressions. **Level 3 — Online Monitoring:** Production quality signals, drift detection, alerting. **Level 4 — Continuous Improvement:** Eval results feed back into training data, prompt optimization, and product decisions. The eval system itself gets evaluated. Most teams should aim for Level 2 within the first month of shipping an agent, and Level 3 within three months. Level 4 is where the best AI teams operate. ## The Bottom Line Building an AI agent without evals is like launching a product without analytics. You're flying blind, and the first time you notice a problem is when a customer complains. The good news: you don't need to boil the ocean. Start with 50 test cases, a simple scoring rubric, and a daily canary eval. That alone puts you ahead of 90% of teams shipping agents today. The teams that master evals will ship better agents, faster, with fewer production fires. That's not a prediction — it's already happening at the companies that take this seriously. --- *Building AI products and want more practical guides like this? Subscribe to the PMtheBuilder newsletter for weekly frameworks and templates.* --- **Related reading:** - Aakash Gupta, ["The One Skill Every AI PM Needs"](https://www.news.aakashg.com/p/ai-evals) — excellent overview of eval types for PMs - Stanford HELM: [crfm.stanford.edu/helm](https://crfm.stanford.edu/helm/) - DeepEval docs: [deepeval.com](https://deepeval.com) - Chatbot Arena: [lmarena.ai](https://lmarena.ai)

How to Evaluate AI Agents: The Missing Guide for Product Engineers

How strong are your AI PM skills?

PM the Builder

Benchmark your AI PM skills

Go deeper with the full toolkit

Free: 68-page AI PM Prompt Library

Related Posts

The Great AI PM Orchestration Split

The AI PM Portfolio Guide: What to Include, How to Build It, and 3 Examples

5 AI PM Frameworks That Actually Work (Not Theoretical Nonsense)

Want more like this?