Evals Are The New Prd

In the Before Times (2022), the artifact of PM work was the PRD. Product Requirements Document. The thing that specified what to build.

You wrote the PRD. Engineering built to spec. QA tested it matched the spec. Ship.

For AI features, this falls apart.

You can't spec your way to a good AI feature. The AI doesn't read your requirements. It does whatever it does, and your job is to figure out if that's good enough.

The new artifact of AI PM work is the eval.

Not the PRD. The eval.

Why PRDs Don't Work for AI

Traditional PRDs assume determinism. "When user clicks X, show Y." You can spec this precisely because the behavior is predictable.

AI features aren't deterministic. Same input, different output. No spec can capture every possible response.

Try writing a PRD for a customer support chatbot:

"The AI shall respond to customer inquiries accurately and helpfully."

Okay, but what's accurate? What's helpful? How do you know if it's working?

"The AI shall not hallucinate."

Great, how do you test that? How do you know when it is hallucinating?

"The AI shall handle edge cases gracefully."

What edge cases? All of them?

See the problem? You can describe what you want, but you can't spec it into existence. The AI is going to do what it does, and you need a way to evaluate whether what it does is acceptable.

That's the eval.

What Is an Eval, Actually?

An eval is a systematic way to measure whether your AI feature is good.

It has three components:

1. Test Cases A set of inputs that you'll run through the AI. These should cover:

Happy path (typical usage)
Edge cases (unusual but valid inputs)
Adversarial cases (inputs designed to break things)
Fairness cases (inputs across different demographics)

2. Evaluation Criteria How you judge each output. This could be:

Ground truth comparison (is it factually correct?)
Rubric scoring (rate 1-5 on helpfulness, accuracy, tone)
Pass/fail rules (does it contain X? does it avoid Y?)

3. Measurement Method How you actually do the evaluation:

Automated (using rules, another LLM, or metrics)
Human review (people score the outputs)
User signals (implicit feedback from real usage)

Put together: Run test cases → Apply criteria → Measure results → Decide if it's good enough.

That's an eval.

The Eval-Driven Development Workflow

Here's how AI PM workflow changes when evals are central:

Before: PRD-Driven

Write PRD specifying the feature
Engineering builds it
QA tests against spec
Ship and hope

After: Eval-Driven

Write the eval first — define what "good" looks like before building
Engineering builds + tests against eval
Iterate until eval passes
Ship with eval as ongoing monitor
Expand eval as you learn

The eval isn't a checkpoint at the end. It's the north star from the beginning.

Why Evals First?

Writing the eval first forces clarity.

"We want helpful AI responses" is vague.

An eval forces you to answer:

Helpful to whom? (persona)
In what situations? (test cases)
What does helpful look like? (rubric)
What's the minimum acceptable helpfulness? (threshold)
How do you know when it's not helpful? (failure criteria)

You can't write a good eval without answering these questions. Which means the eval reveals fuzzy thinking before you waste engineering time.

It's the same principle as TDD (test-driven development), but for AI products.

The Eval Suite Structure

Here's how I structure evals for AI features:

Offline Evals (Before Launch)

Golden Dataset

50-100 hand-crafted test cases
Each has: input, expected output (or rubric), pass criteria
Covers full range of inputs you expect
Run these before every change

Regression Tests

Cases that previously failed
Cases from production incidents
"Never break these again" examples

Adversarial Tests

Prompt injection attempts
Jailbreak attempts
Confusing/ambiguous inputs
Edge cases that typically fail

Fairness Tests

Same queries with different names/genders/locations
Check for bias in responses
Ensure quality is consistent across groups

Online Evals (After Launch)

Automated Monitoring

Latency, error rates, timeouts
LLM-as-judge scoring (sample of outputs)
Keyword detection for safety issues

User Signal Monitoring

Thumbs up/down ratios
Edit rates (did user modify the output?)
Retry rates (did user ask again?)
Completion rates (did user use the output?)

Human Review

Sample X% of outputs for manual review
Label by quality, correctness, safety
Feed back into training data

How to Write Good Eval Criteria

This is where most PMs struggle. Writing test cases is easy. Writing criteria that actually differentiate good from bad is hard.

Bad criteria:

"Response should be helpful" — too vague
"Response should be correct" — correct how?
"Response should be good" — meaningless

Good criteria:

For a customer support chatbot:

"Response must acknowledge the customer's specific issue (not generic)"
"Response must provide actionable next steps"
"Response must not hallucinate product features that don't exist"
"Response must match our brand tone guide (friendly, professional, not corporate)"
"Response must be under 150 words unless complexity requires more"

See the difference? Good criteria are specific enough that two different evaluators would mostly agree.

The test: Can someone unfamiliar with the project read your criteria and evaluate outputs consistently? If not, criteria need more specificity.

LLM-as-Judge: Your Eval Superpower

Here's the unlock: you can use AI to evaluate AI.

Instead of manually reviewing every output, you can have another LLM score outputs against your criteria.

How it works:

SYSTEM: You are an evaluator for a customer support chatbot. Score the following response on these criteria:

1. Acknowledgment (1-5): Did the response acknowledge the customer's specific issue?
2. Actionability (1-5): Did the response provide clear next steps?
3. Accuracy (1-5): Is the information in the response correct?
4. Tone (1-5): Does the response match our friendly, professional tone?

Provide scores and brief justification for each.

USER: 
Customer query: [input]
Chatbot response: [output]

Run this on your test set. You get scores at scale.

Caveats:

LLM-as-judge has its own biases (validate against human judgment)
Use a different model than your production model (avoid self-grading)
Calibrate with human evaluation periodically
Works better for some criteria than others

The Eval-Driven PM Skillset

If you want to be an AI PM, these are the eval skills you need:

1. Test Case Design

Can you generate comprehensive, diverse test cases?
Do you think about edge cases and adversarial inputs?
Can you create fairness-focused test cases?

2. Criteria Writing

Can you articulate what "good" looks like specifically?
Can you create rubrics that enable consistent evaluation?
Can you translate business requirements into eval criteria?

3. Measurement Interpretation

Can you analyze eval results and draw conclusions?
Can you identify patterns in failures?
Can you prioritize which failures matter most?

4. Iteration

Can you use eval results to guide improvement?
Can you update evals as requirements evolve?
Can you balance eval coverage with practical constraints?

These skills are increasingly what separates AI PMs from traditional PMs.

The Minimum Viable Eval

I know this all sounds like a lot of work. Start simple.

Day 1 eval:

Write down 20 test inputs covering main use cases
For each, write what a "good" response looks like
Run your AI on all 20
Manually grade each response
Calculate % passing

That's it. You now have an eval.

Week 1 iteration:

Add 10 more edge cases
Add 10 adversarial cases
Create a simple rubric (1-5 scale)
Track scores over time

Month 1:

Expand to 100+ test cases
Implement LLM-as-judge for automation
Add real user feedback signals
Integrate with CI/CD (run evals on every change)

Start scrappy. Improve over time. But start.

What This Means for PRDs

PRDs aren't dead for AI features. They just change focus.

Traditional PRD focus:

Feature specifications
User flows
Acceptance criteria

AI PRD focus:

Problem definition and user context
Eval criteria and pass thresholds
Failure modes and fallbacks
Safety requirements
Monitoring and iteration plan

The spec becomes less "what the AI should do" and more "how we'll know if the AI is doing well."

Your PRD describes the eval. Engineering builds to pass the eval. The eval is the contract.

Key Takeaways

You can't spec AI into correctness — AI features are evaluated, not specified
Write the eval before building — it forces clarity and becomes your north star
Eval skills are AI PM skills — test case design, criteria writing, measurement interpretation

Evals Are The New Prd

Why PRDs Don't Work for AI

What Is an Eval, Actually?

The Eval-Driven Development Workflow

Before: PRD-Driven

After: Eval-Driven

Why Evals First?

The Eval Suite Structure

Offline Evals (Before Launch)

Online Evals (After Launch)

How to Write Good Eval Criteria

LLM-as-Judge: Your Eval Superpower

The Eval-Driven PM Skillset

The Minimum Viable Eval

What This Means for PRDs

Key Takeaways

How strong are your AI PM skills?

PM the Builder

Benchmark your AI PM skills

Go deeper with the full toolkit

Free: 68-page AI PM Prompt Library

Related Posts

The Great AI PM Orchestration Split

The AI PM Portfolio Guide: What to Include, How to Build It, and 3 Examples

5 AI PM Frameworks That Actually Work (Not Theoretical Nonsense)

Want more like this?