Stop Writing Prds Write Evals
I haven't written a traditional PRD in 6 months.
Not because I'm lazy. Not because I think documentation doesn't matter. Because PRDs are built for a world where software does exactly what you tell it to. And I don't build that kind of software anymore.
I run an AI platform at a $7B+ SaaS company. The platform saves customers over 800,000 hours. Every feature I ship has non-deterministic outputs โ meaning the same input can produce different results every time. And when your outputs aren't predictable, a document that says "the system shall do X" is basically fiction.
So I replaced PRDs with something better. Let me show you what.
The Problem with PRDs in an AI World
A traditional PRD is a contract. It says: here's what we're building, here's how it should behave, here's the acceptance criteria. Engineering builds to the spec. QA tests against the spec. Everyone agrees on what "done" looks like.
This works when outputs are deterministic. When you say "clicking this button saves the form," that's testable. Binary. It either saves or it doesn't.
Now try writing a PRD for an AI feature that generates personalized recommendations for business users.
What would the acceptance criteria look like? "Recommendations should be relevant"? That's not testable. "Recommendations should be accurate"? Accurate compared to what? "Recommendations should be helpful"? By whose definition?
I've seen teams try. They write acceptance criteria like:
- The AI should generate relevant recommendations
- Recommendations should be factually accurate
- The tone should match our brand voice
- The output should be appropriate for the user's context
Those aren't acceptance criteria. Those are wishes. They're subjective, unmeasurable, and they mean different things to different people on the team.
| Traditional PRD | Eval Spec |
|---|---|
| Describes intent: "should be relevant" | Defines measurement: "relevance score โฅ 90% across 500 test cases" |
| Acceptance criteria are binary (pass/fail) | Quality is a spectrum with thresholds per dimension |
| Tested manually on 5โ50 examples | Scored automatically across 200โ500+ cases |
| "Done" = PM approved it | "Done" = eval thresholds met |
| Works for deterministic software | Built for non-deterministic AI outputs |
| Edge cases discovered in production | Edge cases built into the test set upfront |
The result? Engineering builds something. PM looks at a few outputs and says "this seems fine." They ship it. Two weeks later, customer support is flooded with complaints about bizarre recommendations for specific edge cases nobody tested.
I lived this. Early on, before I changed my process, we shipped a feature with a PRD that had what I thought were solid acceptance criteria. We tested maybe 50 outputs manually. Looked great.
In production, it handled common cases beautifully. But for a specific segment of users with unusual data patterns โ maybe 8% of our user base โ the outputs were genuinely bad. Not dangerous, but clearly wrong. The kind of wrong that makes users lose trust in your product permanently.
That failure taught me the PRD was the problem. Not because it was a bad PRD โ it was thorough, well-written, stakeholder-approved. But it was the wrong tool for the job. Like using a ruler to measure temperature.
What Replaced the PRD
I now write three documents for every AI feature instead of one PRD. Together they take about the same time, but they actually work.
1. The Eval Spec
This is the big one. The eval spec is my PRD replacement. It defines quality not as prose descriptions but as measurable criteria with test cases.
An eval spec has:
Quality dimensions with definitions. Not "accuracy" in the abstract โ "Factual accuracy: every claim in the output must be verifiable against the user's actual data. Scoring: binary pass/fail per claim, with an overall score of claims_passed / total_claims."
A test set. Hundreds of input-output pairs covering the full distribution. Common cases, edge cases, adversarial cases, different user segments. Each one has scoring criteria for every quality dimension.
Thresholds. "We ship when accuracy โฅ 95%, relevance โฅ 90%, and safety = 100% across the full test set." These are non-negotiable. They're the quality gate.
Automated scoring where possible. For dimensions that can be machine-scored, the eval spec defines the scoring logic. For dimensions requiring human judgment, it defines the rubric reviewers use.
2. The Model Comparison Doc
Before we commit to a model for a feature, I run candidates against the eval suite and document the results.
This is a simple table: Model A vs. Model B vs. Model C across every quality dimension, plus latency and cost per output. The decision is usually obvious once you see the data.
I've had cases where the "obvious" model choice was wrong. We assumed we needed a frontier model for a specific task. Ran the eval. A model that cost 1/10th as much scored within 2% on quality. That's not a technical decision โ it's a product decision with direct P&L impact. The PM needs to make it.
3. The Edge Case Registry
This one evolved organically. It's a living document of every interesting failure case we discover โ in evals, in production, from customer reports. Each entry has the input, what went wrong, why it went wrong, and whether we've addressed it.
The edge case registry is where product intuition gets encoded into institutional knowledge. When new team members join, they read it. When we update models, we check every case in it. It's the closest thing to tribal knowledge written down.
A Real Example: Eval Spec vs. PRD
Let me show you the difference concretely. I'll use an anonymized version of a feature we built โ an AI system that generates action plans for business users based on their performance data.
What a PRD would say:
Feature: AI Action Plan Generator
User Story: As a business user, I want AI-generated action plans so I can improve my performance metrics.
Acceptance Criteria:
- Action plans should be relevant to the user's specific data
- Recommendations should be actionable and specific
- Tone should be professional and encouraging
- Plans should include 3-5 recommended actions
- Each action should explain the expected impact
That's fine. It communicates intent. But it's not testable at scale.
What my eval spec says (abbreviated):
Feature: AI Action Plan Generator Eval Version: 2.3
Quality Dimensions:
Data Accuracy (weight: 0.30)
- Every metric referenced in the plan must match the user's actual data within 1% tolerance
- No hallucinated metrics or fabricated trends
- Scoring: automated check against source data, binary per-claim, aggregate as percentage
Relevance (weight: 0.25)
- Each recommended action must logically connect to a specific underperformance in the user's data
- Actions must be feasible given the user's current tier and available features
- Scoring: model-graded on 1-5 scale per action, threshold โฅ 4.0 average
Specificity (weight: 0.20)
- Actions must reference specific features, specific metrics, and specific timeframes
- "Improve your email strategy" = fail. "Increase your welcome series open rate by adding a personalized subject line using your customer's first name" = pass
- Scoring: human eval, rubric-graded
Tone (weight: 0.10)
- Professional, direct, confident. Not condescending, not hedging
- No phrases: "you might want to consider," "it could be beneficial to"
- Yes phrases: "Do this," "Here's your priority," "Focus on"
- Scoring: automated keyword check + model-graded for overall tone
Safety (weight: 0.15)
- No recommendations that could harm the user's business if followed
- No recommendations requiring capabilities the user doesn't have
- No references to competitor products
- Scoring: automated rules + human spot-check, binary, threshold = 100%
Test Set: 500 cases
- 350 common profiles (balanced across segments)
- 100 edge cases (new users, power users, unusual data patterns, sparse data)
- 50 adversarial cases (contradictory data, extreme outliers, empty fields)
Ship Threshold:
- Data Accuracy โฅ 97%
- Relevance โฅ 92%
- Specificity โฅ 88%
- Tone โฅ 90%
- Safety = 100%
See the difference? The eval spec doesn't leave quality up to interpretation. It defines exactly what good means, how to measure it, and what threshold we need to hit before we ship.
Engineering doesn't build to a vague description of "relevant and actionable." They build until the evals pass. And when someone asks "is this feature ready to ship?" the answer isn't a PM's gut feel โ it's a dashboard showing pass rates across every dimension.
The Mini-Template
Here's a stripped-down eval spec template you can steal and use immediately. Adapt it to your product.
Feature Name: Eval Version: Owner: Last Updated:
Quality Dimensions:
For each dimension, define:
- Name and weight (how much does this dimension matter relative to others?)
- Definition (what does this dimension mean, specifically, for this feature?)
- Scoring method (automated / model-graded / human eval?)
- Scoring criteria (rubric, scale, or binary โ be explicit)
- Threshold (minimum score to ship)
Test Set:
- Total size: (minimum 200 for any non-trivial feature)
- Common case split: (% covering typical usage)
- Edge case split: (% covering unusual but valid inputs)
- Adversarial split: (% covering intentionally tricky inputs)
- Segment coverage: (list user segments and minimum cases per segment)
Ship Criteria:
- All dimensions must meet threshold: Yes/No
- Regression check against previous version: Yes/No
- Human review sample size before final sign-off: (number)
Edge Cases to Watch: (List specific scenarios you're worried about โ this seeds your edge case registry)
The Shift Is Mental, Not Technical
I want to be clear about something: writing eval specs isn't hard. The template above isn't complicated. Any smart PM could fill it out.
The hard part is the mental shift. It's accepting that your beautifully crafted PRD โ the one you spent a week on, the one your VP praised, the one you're proud of โ isn't the right tool anymore.
It's accepting that "the AI should generate relevant recommendations" isn't an acceptance criterion. It's a hope.
It's accepting that if you can't measure quality, you don't understand quality. And if you don't understand quality, you're not qualified to ship the feature.
I made this shift the hard way โ by shipping a feature that failed. You can make it the easy way, by starting to think in evals now, before your next AI feature forces you to.
Try This Week
Take the last PRD you wrote (or one you're working on now). For every acceptance criterion, ask yourself: "How would I test this across 500 different inputs?"
If the answer is "I can't" โ rewrite it as an eval criterion. Define the dimension, the scoring method, and the threshold.
You don't need to build the eval infrastructure. Just write the spec. The act of defining quality rigorously will change how you think about AI features permanently.
Then share it with your engineering team. I promise you โ they'll prefer it to a PRD. Because for the first time, they'll know exactly what "done" means.
The PM who writes the best eval wins. Not the PM who writes the best spec.
PRDs were the right tool for deterministic software. Evals are the right tool for AI. The transition isn't optional โ it's happening whether you drive it or not.
The PMs who figure this out first are the ones defining quality at their companies. The ones who don't are writing documents that nobody can test, shipping features they can't measure, and wondering why the AI PM down the hall is getting promoted faster.
Don't be that PM. Write the eval.
Free Tool
How strong are your AI PM skills?
8 real production scenarios. LLM-judged across 5 dimensions. Takes ~15 minutes. See exactly where your gaps are.
PM the Builder
Practical AI product management โ backed by PM leaders who build AI products, hire AI PMs, and ship every day. Building what we wish existed when we started.
Benchmark your AI PM skills
8 production scenarios. Free. LLM-judged. See where you stand.
Go deeper with the full toolkit
Playbooks, interview prep, prompt libraries, and production frameworks โ built by the teams who hire AI PMs.
Free: 68-page AI PM Prompt Library
Production-ready prompts for evals, architecture reviews, stakeholder comms, and shipping. Enter your email, get the PDF.
Related Posts
Want more like this?
Get weekly tactics for AI product managers.