PMtheBuilder logoPMtheBuilder
ยท2/5/2026ยท5 min read

How to Evaluate AI Agents: The Missing Guide for Product Engineers

Guide
# How to Evaluate AI Agents: The Missing Guide for Product Engineers **Meta description:** Learn how to evaluate AI agents with practical frameworks, scoring methodologies, and code examples. The complete guide to AI agent evals for product engineers. **SEO targets:** how to evaluate AI agents, AI agent evals, LLM evaluation framework --- Everyone's shipping AI agents. Almost nobody is evaluating them properly. You've seen the pattern: a team builds a customer support agent, demos it to leadership, gets applause, ships it to production โ€” and three weeks later, users are complaining about hallucinated refund policies and conversations that go in circles. The gap isn't in *building* agents. It's in *evaluating* them. This guide is the missing manual. Whether you're a product engineer, an AI PM, or someone who just inherited an agent that's already in production, this is how you build an evaluation system that actually works. ## Why Agent Evals Are Different from Model Evals If you've worked with LLMs, you've probably seen benchmarks like MMLU, HELM, or Chatbot Arena. These are **model-level evaluations** โ€” they tell you whether GPT-4o is generally smarter than Claude 3.5 Sonnet at reasoning tasks. Agent evals are fundamentally different. Here's why: | Dimension | Model Evals | Agent Evals | |-----------|------------|-------------| | What you're testing | Raw capability | End-to-end behavior | | Input/Output | Prompt โ†’ Completion | Goal โ†’ Multi-step outcome | | Determinism | Mostly deterministic | Highly stochastic | | Scope | Single turn | Multi-turn, tool use, memory | | Failure modes | Wrong answer | Wrong action, loops, hallucinated tool calls | An agent might use the right model but still fail because of bad tool orchestration, poor retrieval, or broken memory management. Model benchmarks won't catch any of that. ## The Three Layers of Agent Evaluation Think of agent evals as a pyramid: ### Layer 1: Component Evals (Unit Tests for AI) Test each piece in isolation: - **LLM quality:** Is the base model producing good completions for your prompts? - **Retrieval quality:** Is your RAG pipeline returning relevant documents? - **Tool accuracy:** When the agent calls a function, does it pass correct parameters? ### Layer 2: Trajectory Evals (Integration Tests for AI) Test the agent's decision-making path: - Did the agent take the right *sequence* of steps? - Did it use the right tools in the right order? - Did it know when to ask for clarification vs. when to act? ### Layer 3: Outcome Evals (End-to-End Tests for AI) Test whether the agent achieved the goal: - Did the customer's issue get resolved? - Was the final output correct and complete? - How long did it take? How many steps? Most teams only do Layer 3 (if they evaluate at all). The magic is in combining all three. ## A Practical Example: Evaluating "SupportBot" Let's make this concrete. You're the PM for **SupportBot**, a customer support AI agent at a B2B SaaS company. SupportBot handles: - Account questions ("What plan am I on?") - Billing issues ("Why was I charged twice?") - Feature requests ("Can you add dark mode?") - Bug reports ("The export button is broken") Here's how to build a real eval system for it. ### Step 1: Build Your Eval Dataset You need test cases. Not 5. Not 50. **At minimum 200**, spread across your agent's capabilities. ```python # eval_dataset.py eval_cases = [ { "id": "billing_001", "category": "billing", "input": "I was charged $99 but I'm on the free plan", "expected_tools": ["lookup_account", "check_billing_history"], "expected_behavior": "Verify account status, check for billing discrepancy, escalate if confirmed", "expected_outcome": "Agent identifies billing error and initiates refund OR correctly explains charge", "golden_response_keywords": ["billing", "account", "refund OR charge explanation"], "difficulty": "medium" }, { "id": "account_001", "category": "account", "input": "What plan am I on and when does it renew?", "expected_tools": ["lookup_account"], "expected_behavior": "Look up account, return plan name and renewal date", "expected_outcome": "Correct plan name and exact renewal date", "golden_response_keywords": ["plan", "renewal", "date"], "difficulty": "easy" }, # ... 198 more cases ] ``` **Pro tip:** Seed your dataset from real conversations. Pull 500 actual support tickets, categorize them, and turn the best examples into eval cases. This is 10x more valuable than synthetic data. ### Step 2: Define Your Scoring Rubric You need metrics that actually mean something. Here's the rubric I recommend: ```python # scoring.py from dataclasses import dataclass from enum import Enum class Score(Enum): FAIL = 0 PARTIAL = 1 PASS = 2 @dataclass class EvalResult: case_id: str # Layer 1: Component scores retrieval_relevance: float # 0-1, were the right docs fetched? tool_selection_accuracy: Score # Did it pick the right tools? # Layer 2: Trajectory scores step_efficiency: float # optimal_steps / actual_steps no_hallucinated_actions: bool # Did it call tools that don't exist? appropriate_escalation: bool # Did it escalate when it should have? # Layer 3: Outcome scores task_completed: bool response_quality: float # 0-1, LLM-as-judge score factual_accuracy: Score # Did it state correct facts? tone_appropriate: bool @property def composite_score(self) -> float: weights = { 'factual_accuracy': 0.25, 'task_completed': 0.25, 'response_quality': 0.20, 'tool_selection_accuracy': 0.15, 'step_efficiency': 0.10, 'tone_appropriate': 0.05, } raw = ( (self.factual_accuracy.value / 2) * weights['factual_accuracy'] + float(self.task_completed) * weights['task_completed'] + self.response_quality * weights['response_quality'] + (self.tool_selection_accuracy.value / 2) * weights['tool_selection_accuracy'] + self.step_efficiency * weights['step_efficiency'] + float(self.tone_appropriate) * weights['tone_appropriate'] ) return round(raw, 3) ``` ### Step 3: Implement LLM-as-Judge For subjective metrics (response quality, tone), use another LLM as a judge. This is the most scalable approach, but it needs calibration. ```python # llm_judge.py import json from openai import OpenAI client = OpenAI() JUDGE_PROMPT = """You are evaluating an AI customer support agent's response. ## Context Customer query: {query} Agent response: {response} Expected behavior: {expected} ## Evaluation Criteria Rate each dimension from 0.0 to 1.0: 1. **Helpfulness**: Did the response address the customer's actual need? 2. **Accuracy**: Are all stated facts correct? (0.0 if any hallucination detected) 3. **Completeness**: Did it cover everything needed, without over-explaining? 4. **Tone**: Professional, empathetic, appropriate for the situation? 5. **Actionability**: Does the customer know exactly what to do next? Return JSON: {{"helpfulness": 0.0-1.0, "accuracy": 0.0-1.0, "completeness": 0.0-1.0, "tone": 0.0-1.0, "actionability": 0.0-1.0, "reasoning": "brief explanation"}} """ def judge_response(query: str, response: str, expected: str) -> dict: result = client.chat.completions.create( model="gpt-4o", response_format={"type": "json_object"}, messages=[{ "role": "user", "content": JUDGE_PROMPT.format( query=query, response=response, expected=expected ) }], temperature=0.1 # Low temp for consistency ) return json.loads(result.choices[0].message.content) ``` **Critical:** Calibrate your judge. Run it against 50 cases you've manually scored, and check the correlation. If the LLM judge disagrees with humans more than 20% of the time, your judge prompt needs work. ### Step 4: Build the Eval Harness Now wire it all together: ```python # eval_harness.py import asyncio from datetime import datetime from typing import List from your_agent import SupportBot # your actual agent async def run_eval_suite( agent: SupportBot, cases: List[dict], run_id: str = None ) -> dict: run_id = run_id or f"eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}" results = [] for case in cases: # Run the agent trace = await agent.run( message=case["input"], trace_enabled=True # Capture tool calls, intermediate steps ) # Score components tool_accuracy = score_tool_selection( actual_tools=trace.tools_called, expected_tools=case["expected_tools"] ) # LLM judge for quality judge_scores = judge_response( query=case["input"], response=trace.final_response, expected=case["expected_outcome"] ) # Check for hallucinated tool calls valid_tools = agent.get_available_tools() hallucinated = any(t not in valid_tools for t in trace.tools_called) result = EvalResult( case_id=case["id"], retrieval_relevance=score_retrieval(trace), tool_selection_accuracy=tool_accuracy, step_efficiency=len(case["expected_tools"]) / max(len(trace.tools_called), 1), no_hallucinated_actions=not hallucinated, appropriate_escalation=check_escalation(trace, case), task_completed=judge_scores["completeness"] > 0.7, response_quality=judge_scores["helpfulness"], factual_accuracy=Score.PASS if judge_scores["accuracy"] > 0.9 else Score.FAIL, tone_appropriate=judge_scores["tone"] > 0.7, ) results.append(result) # Aggregate avg_score = sum(r.composite_score for r in results) / len(results) pass_rate = sum(1 for r in results if r.composite_score > 0.7) / len(results) return { "run_id": run_id, "total_cases": len(results), "avg_composite_score": round(avg_score, 3), "pass_rate": f"{pass_rate:.1%}", "by_category": aggregate_by_category(results, cases), "worst_cases": sorted(results, key=lambda r: r.composite_score)[:10], "results": results, } ``` ### Step 5: Run It in CI/CD The eval is only useful if it runs automatically. Here's a basic GitHub Actions setup: ```yaml # .github/workflows/agent-eval.yml name: Agent Eval Suite on: pull_request: paths: - 'agent/**' - 'prompts/**' - 'tools/**' schedule: - cron: '0 6 * * *' # Daily at 6am UTC jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: '3.12' - run: pip install -r requirements.txt - name: Run eval suite env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: python -m evals.run --suite full --output results.json - name: Check regression run: | python -m evals.check_regression \ --current results.json \ --baseline evals/baseline.json \ --max-regression 0.05 - name: Post results to PR if: github.event_name == 'pull_request' uses: actions/github-script@v7 with: script: | const results = require('./results.json'); const body = `## ๐Ÿค– Agent Eval Results | Metric | Score | |--------|-------| | Composite | ${results.avg_composite_score} | | Pass Rate | ${results.pass_rate} | | Factual Accuracy | ${results.factual_accuracy_rate} | | Regression | ${results.regression_detected ? 'โš ๏ธ YES' : 'โœ… None'} | `; github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body }); ``` ## Offline vs. Online Evals Everything above is **offline evaluation** โ€” you run it against a fixed dataset before deployment. But you also need **online evaluation** โ€” monitoring the agent in production. ### Offline Evals (Pre-deployment) - โœ… Controlled, reproducible - โœ… Can test edge cases and adversarial inputs - โœ… Blocks bad changes from shipping - โŒ Doesn't capture real user behavior - โŒ Dataset may not reflect production distribution ### Online Evals (Post-deployment) - โœ… Real user interactions - โœ… Catches distribution shift and novel inputs - โœ… Measures actual business outcomes - โŒ By definition, users see failures - โŒ Harder to attribute regressions **You need both.** The ratio depends on your risk tolerance. Customer support? Heavy offline evals + conservative deploy gates. Internal productivity tool? Lighter offline, heavier online monitoring. ### Online Monitoring Essentials ```typescript // monitor.ts โ€” lightweight production eval interface AgentTrace { sessionId: string; query: string; response: string; toolCalls: ToolCall[]; latencyMs: number; tokenCount: number; timestamp: Date; } interface QualitySignals { // Implicit signals (no user effort) conversationLength: number; // Long = possibly struggling toolCallFailures: number; // Failed API calls selfCorrections: number; // Agent contradicted itself escalatedToHuman: boolean; // Had to bail out // Explicit signals (user feedback) thumbsUp: boolean | null; csatScore: number | null; // 1-5 } function detectAnomalies(traces: AgentTrace[], window: string = '1h'): Alert[] { const alerts: Alert[] = []; const recent = filterByWindow(traces, window); // Hallucination proxy: response references tools/data the agent doesn't have const hallRate = recent.filter(t => detectHallucination(t)).length / recent.length; if (hallRate > 0.05) { alerts.push({ severity: 'high', type: 'hallucination_spike', rate: hallRate }); } // Latency regression const p95Latency = percentile(recent.map(t => t.latencyMs), 95); if (p95Latency > 10000) { alerts.push({ severity: 'medium', type: 'latency_regression', p95: p95Latency }); } // Escalation rate spike const escalationRate = recent.filter(t => t.escalatedToHuman).length / recent.length; if (escalationRate > 0.3) { alerts.push({ severity: 'high', type: 'escalation_spike', rate: escalationRate }); } return alerts; } ``` ## Hallucination Monitoring Hallucinations are the #1 risk in production agents. Three practical approaches: ### 1. Factual Grounding Check After the agent responds, run a verification pass: does every factual claim in the response trace back to a retrieved document or tool output? ### 2. Consistency Check Ask the same question 3 times with slight paraphrasing. If the agent gives materially different answers, something's wrong. ### 3. Entailment Scoring Use an NLI (Natural Language Inference) model to check if the agent's response is *entailed by* its retrieved context. If the response says things the context doesn't support, flag it. ```python # hallucination_check.py def check_grounding(response: str, sources: list[str]) -> float: """Score 0-1 for how well the response is grounded in sources.""" result = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": f"""Given these source documents: {chr(10).join(sources)} And this agent response: {response} Score from 0.0 to 1.0: what fraction of claims in the response are directly supported by the sources? Return just the number.""" }], temperature=0 ) return float(result.choices[0].message.content.strip()) ``` ## Model Drift Detection Your agent will degrade over time, even if you change nothing. Why? - **Model provider updates:** OpenAI/Anthropic quietly update models - **Data drift:** User questions shift (seasonal, product changes) - **Context drift:** Your RAG documents get stale ### Canary Eval Pattern Run a fixed set of 50 "canary" test cases every day. Track scores over time. If the 7-day moving average drops by more than 5%, trigger an alert. ```python # canary.py def run_canary(canary_cases: list, history_file: str = "canary_history.jsonl"): today_scores = [] for case in canary_cases: result = run_single_eval(case) today_scores.append(result.composite_score) avg = sum(today_scores) / len(today_scores) # Append to history with open(history_file, "a") as f: f.write(json.dumps({"date": str(date.today()), "avg_score": avg}) + "\n") # Check 7-day trend history = load_history(history_file) if len(history) >= 7: recent_avg = sum(h["avg_score"] for h in history[-7:]) / 7 baseline_avg = sum(h["avg_score"] for h in history[-30:-7]) / max(len(history[-30:-7]), 1) if baseline_avg - recent_avg > 0.05: send_alert(f"โš ๏ธ Agent quality regression: {baseline_avg:.3f} โ†’ {recent_avg:.3f}") ``` ## The Eval Frameworks Landscape Here's what's out there and when to use each: | Framework | Best For | Approach | |-----------|----------|----------| | **Chatbot Arena (LMSYS)** | Comparing base models | Crowdsourced human preference (Elo ratings) | | **HELM** (Stanford) | Holistic model assessment | Multi-metric benchmark across scenarios | | **DeepEval** | Agent & LLM eval in CI/CD | Python framework, LLM-as-judge, 14+ metrics | | **LangSmith** | LangChain-based agents | Tracing + eval built into LangChain ecosystem | | **Langfuse** | Open-source observability | Tracing, scoring, prompt management | | **Braintrust** | Production LLM apps | Logging, eval, prompt playground | | **Opik** (Comet) | Open-source LLM eval | Tracing, automated scoring, CI integration | **My recommendation for most teams:** Start with DeepEval or Langfuse for structure, but build your domain-specific scoring rubric from scratch. No off-the-shelf framework knows that "SupportBot should never promise a refund without checking the billing system first." ## Regression Testing for AI: The Non-Obvious Parts Traditional regression testing is binary: it works or it doesn't. AI regression testing is probabilistic. Here's how to handle it: ### 1. Statistical Significance Don't fail a PR because one score dipped by 0.01. Use a paired t-test or bootstrap confidence interval to determine if the regression is statistically significant. ### 2. Category-Level Regression Your overall score might stay flat while one category tanks. Always break results down by category, difficulty level, and tool dependency. ### 3. The Baseline Problem What's "good enough"? Set your baseline from a human-evaluated golden set. Have 3 humans rate 100 agent responses, take the average โ€” that's your target. If the agent scores within 0.05 of human quality, it ships. ### 4. Version Pinning Always pin your eval against a specific model version, tool version, and prompt version. When something regresses, you need to know *which* change caused it. ## Putting It All Together: The Eval Maturity Model **Level 0 โ€” Vibes:** "It seems to work well in demos." (Most teams are here.) **Level 1 โ€” Manual Spot Checks:** Someone reviews 20 conversations per week. **Level 2 โ€” Automated Offline Evals:** Eval suite runs in CI, blocks regressions. **Level 3 โ€” Online Monitoring:** Production quality signals, drift detection, alerting. **Level 4 โ€” Continuous Improvement:** Eval results feed back into training data, prompt optimization, and product decisions. The eval system itself gets evaluated. Most teams should aim for Level 2 within the first month of shipping an agent, and Level 3 within three months. Level 4 is where the best AI teams operate. ## The Bottom Line Building an AI agent without evals is like launching a product without analytics. You're flying blind, and the first time you notice a problem is when a customer complains. The good news: you don't need to boil the ocean. Start with 50 test cases, a simple scoring rubric, and a daily canary eval. That alone puts you ahead of 90% of teams shipping agents today. The teams that master evals will ship better agents, faster, with fewer production fires. That's not a prediction โ€” it's already happening at the companies that take this seriously. --- *Building AI products and want more practical guides like this? Subscribe to the PMtheBuilder newsletter for weekly frameworks and templates.* --- **Related reading:** - Aakash Gupta, ["The One Skill Every AI PM Needs"](https://www.news.aakashg.com/p/ai-evals) โ€” excellent overview of eval types for PMs - Stanford HELM: [crfm.stanford.edu/helm](https://crfm.stanford.edu/helm/) - DeepEval docs: [deepeval.com](https://deepeval.com) - Chatbot Arena: [lmarena.ai](https://lmarena.ai)
๐Ÿงช

Free Tool

How strong are your AI PM skills?

8 real production scenarios. LLM-judged across 5 dimensions. Takes ~15 minutes. See exactly where your gaps are.

Take the Free Eval โ†’
๐Ÿ› ๏ธ

PM the Builder

Practical AI product management โ€” backed by PM leaders who build AI products, hire AI PMs, and ship every day. Building what we wish existed when we started.

๐Ÿงช

Benchmark your AI PM skills

8 production scenarios. Free. LLM-judged. See where you stand.

Take the Eval โ†’
๐Ÿ“˜

Go deeper with the full toolkit

Playbooks, interview prep, prompt libraries, and production frameworks โ€” built by the teams who hire AI PMs.

Browse Products โ†’
โšก

Free: 68-page AI PM Prompt Library

Production-ready prompts for evals, architecture reviews, stakeholder comms, and shipping. Enter your email, get the PDF.

Get It Free โ†’

Want more like this?

Get weekly tactics for AI product managers.