PMtheBuilder logoPMtheBuilder
ยท2/1/2026ยท8 min read

Evals Are The New Prd

Guide

In the Before Times (2022), the artifact of PM work was the PRD. Product Requirements Document. The thing that specified what to build.

You wrote the PRD. Engineering built to spec. QA tested it matched the spec. Ship.

For AI features, this falls apart.

You can't spec your way to a good AI feature. The AI doesn't read your requirements. It does whatever it does, and your job is to figure out if that's good enough.

The new artifact of AI PM work is the eval.

Not the PRD. The eval.


Why PRDs Don't Work for AI

Traditional PRDs assume determinism. "When user clicks X, show Y." You can spec this precisely because the behavior is predictable.

AI features aren't deterministic. Same input, different output. No spec can capture every possible response.

Try writing a PRD for a customer support chatbot:

"The AI shall respond to customer inquiries accurately and helpfully."

Okay, but what's accurate? What's helpful? How do you know if it's working?

"The AI shall not hallucinate."

Great, how do you test that? How do you know when it is hallucinating?

"The AI shall handle edge cases gracefully."

What edge cases? All of them?

See the problem? You can describe what you want, but you can't spec it into existence. The AI is going to do what it does, and you need a way to evaluate whether what it does is acceptable.

That's the eval.


What Is an Eval, Actually?

An eval is a systematic way to measure whether your AI feature is good.

It has three components:

1. Test Cases A set of inputs that you'll run through the AI. These should cover:

  • Happy path (typical usage)
  • Edge cases (unusual but valid inputs)
  • Adversarial cases (inputs designed to break things)
  • Fairness cases (inputs across different demographics)

2. Evaluation Criteria How you judge each output. This could be:

  • Ground truth comparison (is it factually correct?)
  • Rubric scoring (rate 1-5 on helpfulness, accuracy, tone)
  • Pass/fail rules (does it contain X? does it avoid Y?)

3. Measurement Method How you actually do the evaluation:

  • Automated (using rules, another LLM, or metrics)
  • Human review (people score the outputs)
  • User signals (implicit feedback from real usage)

Put together: Run test cases โ†’ Apply criteria โ†’ Measure results โ†’ Decide if it's good enough.

That's an eval.


The Eval-Driven Development Workflow

Here's how AI PM workflow changes when evals are central:

Before: PRD-Driven

  1. Write PRD specifying the feature
  2. Engineering builds it
  3. QA tests against spec
  4. Ship and hope

After: Eval-Driven

  1. Write the eval first โ€” define what "good" looks like before building
  2. Engineering builds + tests against eval
  3. Iterate until eval passes
  4. Ship with eval as ongoing monitor
  5. Expand eval as you learn

The eval isn't a checkpoint at the end. It's the north star from the beginning.


Why Evals First?

Writing the eval first forces clarity.

"We want helpful AI responses" is vague.

An eval forces you to answer:

  • Helpful to whom? (persona)
  • In what situations? (test cases)
  • What does helpful look like? (rubric)
  • What's the minimum acceptable helpfulness? (threshold)
  • How do you know when it's not helpful? (failure criteria)

You can't write a good eval without answering these questions. Which means the eval reveals fuzzy thinking before you waste engineering time.

It's the same principle as TDD (test-driven development), but for AI products.


The Eval Suite Structure

Here's how I structure evals for AI features:

Offline Evals (Before Launch)

Golden Dataset

  • 50-100 hand-crafted test cases
  • Each has: input, expected output (or rubric), pass criteria
  • Covers full range of inputs you expect
  • Run these before every change

Regression Tests

  • Cases that previously failed
  • Cases from production incidents
  • "Never break these again" examples

Adversarial Tests

  • Prompt injection attempts
  • Jailbreak attempts
  • Confusing/ambiguous inputs
  • Edge cases that typically fail

Fairness Tests

  • Same queries with different names/genders/locations
  • Check for bias in responses
  • Ensure quality is consistent across groups

Online Evals (After Launch)

Automated Monitoring

  • Latency, error rates, timeouts
  • LLM-as-judge scoring (sample of outputs)
  • Keyword detection for safety issues

User Signal Monitoring

  • Thumbs up/down ratios
  • Edit rates (did user modify the output?)
  • Retry rates (did user ask again?)
  • Completion rates (did user use the output?)

Human Review

  • Sample X% of outputs for manual review
  • Label by quality, correctness, safety
  • Feed back into training data

How to Write Good Eval Criteria

This is where most PMs struggle. Writing test cases is easy. Writing criteria that actually differentiate good from bad is hard.

Bad criteria:

  • "Response should be helpful" โ€” too vague
  • "Response should be correct" โ€” correct how?
  • "Response should be good" โ€” meaningless

Good criteria:

For a customer support chatbot:

  • "Response must acknowledge the customer's specific issue (not generic)"
  • "Response must provide actionable next steps"
  • "Response must not hallucinate product features that don't exist"
  • "Response must match our brand tone guide (friendly, professional, not corporate)"
  • "Response must be under 150 words unless complexity requires more"

See the difference? Good criteria are specific enough that two different evaluators would mostly agree.

The test: Can someone unfamiliar with the project read your criteria and evaluate outputs consistently? If not, criteria need more specificity.


LLM-as-Judge: Your Eval Superpower

Here's the unlock: you can use AI to evaluate AI.

Instead of manually reviewing every output, you can have another LLM score outputs against your criteria.

How it works:

SYSTEM: You are an evaluator for a customer support chatbot. Score the following response on these criteria:

1. Acknowledgment (1-5): Did the response acknowledge the customer's specific issue?
2. Actionability (1-5): Did the response provide clear next steps?
3. Accuracy (1-5): Is the information in the response correct?
4. Tone (1-5): Does the response match our friendly, professional tone?

Provide scores and brief justification for each.

USER: 
Customer query: [input]
Chatbot response: [output]

Run this on your test set. You get scores at scale.

Caveats:

  • LLM-as-judge has its own biases (validate against human judgment)
  • Use a different model than your production model (avoid self-grading)
  • Calibrate with human evaluation periodically
  • Works better for some criteria than others

The Eval-Driven PM Skillset

If you want to be an AI PM, these are the eval skills you need:

1. Test Case Design

  • Can you generate comprehensive, diverse test cases?
  • Do you think about edge cases and adversarial inputs?
  • Can you create fairness-focused test cases?

2. Criteria Writing

  • Can you articulate what "good" looks like specifically?
  • Can you create rubrics that enable consistent evaluation?
  • Can you translate business requirements into eval criteria?

3. Measurement Interpretation

  • Can you analyze eval results and draw conclusions?
  • Can you identify patterns in failures?
  • Can you prioritize which failures matter most?

4. Iteration

  • Can you use eval results to guide improvement?
  • Can you update evals as requirements evolve?
  • Can you balance eval coverage with practical constraints?

These skills are increasingly what separates AI PMs from traditional PMs.


The Minimum Viable Eval

I know this all sounds like a lot of work. Start simple.

Day 1 eval:

  1. Write down 20 test inputs covering main use cases
  2. For each, write what a "good" response looks like
  3. Run your AI on all 20
  4. Manually grade each response
  5. Calculate % passing

That's it. You now have an eval.

Week 1 iteration:

  • Add 10 more edge cases
  • Add 10 adversarial cases
  • Create a simple rubric (1-5 scale)
  • Track scores over time

Month 1:

  • Expand to 100+ test cases
  • Implement LLM-as-judge for automation
  • Add real user feedback signals
  • Integrate with CI/CD (run evals on every change)

Start scrappy. Improve over time. But start.


What This Means for PRDs

PRDs aren't dead for AI features. They just change focus.

Traditional PRD focus:

  • Feature specifications
  • User flows
  • Acceptance criteria

AI PRD focus:

  • Problem definition and user context
  • Eval criteria and pass thresholds
  • Failure modes and fallbacks
  • Safety requirements
  • Monitoring and iteration plan

The spec becomes less "what the AI should do" and more "how we'll know if the AI is doing well."

Your PRD describes the eval. Engineering builds to pass the eval. The eval is the contract.


Key Takeaways

  1. You can't spec AI into correctness โ€” AI features are evaluated, not specified

  2. Write the eval before building โ€” it forces clarity and becomes your north star

  3. Eval skills are AI PM skills โ€” test case design, criteria writing, measurement interpretation

๐Ÿงช

Free Tool

How strong are your AI PM skills?

8 real production scenarios. LLM-judged across 5 dimensions. Takes ~15 minutes. See exactly where your gaps are.

Take the Free Eval โ†’
๐Ÿ› ๏ธ

PM the Builder

Practical AI product management โ€” backed by PM leaders who build AI products, hire AI PMs, and ship every day. Building what we wish existed when we started.

๐Ÿงช

Benchmark your AI PM skills

8 production scenarios. Free. LLM-judged. See where you stand.

Take the Eval โ†’
๐Ÿ“˜

Go deeper with the full toolkit

Playbooks, interview prep, prompt libraries, and production frameworks โ€” built by the teams who hire AI PMs.

Browse Products โ†’
โšก

Free: 68-page AI PM Prompt Library

Production-ready prompts for evals, architecture reviews, stakeholder comms, and shipping. Enter your email, get the PDF.

Get It Free โ†’

Want more like this?

Get weekly tactics for AI product managers.