Stop Writing Prds Write Evals

I haven't written a traditional PRD in 6 months.

Not because I'm lazy. Not because I think documentation doesn't matter. Because PRDs are built for a world where software does exactly what you tell it to. And I don't build that kind of software anymore.

I run an AI platform at a $7B+ SaaS company. The platform saves customers over 800,000 hours. Every feature I ship has non-deterministic outputs — meaning the same input can produce different results every time. And when your outputs aren't predictable, a document that says "the system shall do X" is basically fiction.

So I replaced PRDs with something better. Let me show you what.

The Problem with PRDs in an AI World

A traditional PRD is a contract. It says: here's what we're building, here's how it should behave, here's the acceptance criteria. Engineering builds to the spec. QA tests against the spec. Everyone agrees on what "done" looks like.

This works when outputs are deterministic. When you say "clicking this button saves the form," that's testable. Binary. It either saves or it doesn't.

Now try writing a PRD for an AI feature that generates personalized recommendations for business users.

What would the acceptance criteria look like? "Recommendations should be relevant"? That's not testable. "Recommendations should be accurate"? Accurate compared to what? "Recommendations should be helpful"? By whose definition?

I've seen teams try. They write acceptance criteria like:

The AI should generate relevant recommendations
Recommendations should be factually accurate
The tone should match our brand voice
The output should be appropriate for the user's context

Those aren't acceptance criteria. Those are wishes. They're subjective, unmeasurable, and they mean different things to different people on the team.

Traditional PRD	Eval Spec
Describes intent: "should be relevant"	Defines measurement: "relevance score ≥ 90% across 500 test cases"
Acceptance criteria are binary (pass/fail)	Quality is a spectrum with thresholds per dimension
Tested manually on 5–50 examples	Scored automatically across 200–500+ cases
"Done" = PM approved it	"Done" = eval thresholds met
Works for deterministic software	Built for non-deterministic AI outputs
Edge cases discovered in production	Edge cases built into the test set upfront

The result? Engineering builds something. PM looks at a few outputs and says "this seems fine." They ship it. Two weeks later, customer support is flooded with complaints about bizarre recommendations for specific edge cases nobody tested.

I lived this. Early on, before I changed my process, we shipped a feature with a PRD that had what I thought were solid acceptance criteria. We tested maybe 50 outputs manually. Looked great.

In production, it handled common cases beautifully. But for a specific segment of users with unusual data patterns — maybe 8% of our user base — the outputs were genuinely bad. Not dangerous, but clearly wrong. The kind of wrong that makes users lose trust in your product permanently.

That failure taught me the PRD was the problem. Not because it was a bad PRD — it was thorough, well-written, stakeholder-approved. But it was the wrong tool for the job. Like using a ruler to measure temperature.

What Replaced the PRD

I now write three documents for every AI feature instead of one PRD. Together they take about the same time, but they actually work.

1. The Eval Spec

This is the big one. The eval spec is my PRD replacement. It defines quality not as prose descriptions but as measurable criteria with test cases.

An eval spec has:

Quality dimensions with definitions. Not "accuracy" in the abstract — "Factual accuracy: every claim in the output must be verifiable against the user's actual data. Scoring: binary pass/fail per claim, with an overall score of claims_passed / total_claims."

A test set. Hundreds of input-output pairs covering the full distribution. Common cases, edge cases, adversarial cases, different user segments. Each one has scoring criteria for every quality dimension.

Thresholds. "We ship when accuracy ≥ 95%, relevance ≥ 90%, and safety = 100% across the full test set." These are non-negotiable. They're the quality gate.

Automated scoring where possible. For dimensions that can be machine-scored, the eval spec defines the scoring logic. For dimensions requiring human judgment, it defines the rubric reviewers use.

2. The Model Comparison Doc

Before we commit to a model for a feature, I run candidates against the eval suite and document the results.

This is a simple table: Model A vs. Model B vs. Model C across every quality dimension, plus latency and cost per output. The decision is usually obvious once you see the data.

I've had cases where the "obvious" model choice was wrong. We assumed we needed a frontier model for a specific task. Ran the eval. A model that cost 1/10th as much scored within 2% on quality. That's not a technical decision — it's a product decision with direct P&L impact. The PM needs to make it.

3. The Edge Case Registry

This one evolved organically. It's a living document of every interesting failure case we discover — in evals, in production, from customer reports. Each entry has the input, what went wrong, why it went wrong, and whether we've addressed it.

The edge case registry is where product intuition gets encoded into institutional knowledge. When new team members join, they read it. When we update models, we check every case in it. It's the closest thing to tribal knowledge written down.

A Real Example: Eval Spec vs. PRD

Let me show you the difference concretely. I'll use an anonymized version of a feature we built — an AI system that generates action plans for business users based on their performance data.

What a PRD would say:

Feature: AI Action Plan Generator

User Story: As a business user, I want AI-generated action plans so I can improve my performance metrics.

Acceptance Criteria:

Action plans should be relevant to the user's specific data
Recommendations should be actionable and specific
Tone should be professional and encouraging
Plans should include 3-5 recommended actions
Each action should explain the expected impact

That's fine. It communicates intent. But it's not testable at scale.

What my eval spec says (abbreviated):

Feature: AI Action Plan Generator Eval Version: 2.3

Quality Dimensions:

Data Accuracy (weight: 0.30)

Every metric referenced in the plan must match the user's actual data within 1% tolerance
No hallucinated metrics or fabricated trends
Scoring: automated check against source data, binary per-claim, aggregate as percentage

Relevance (weight: 0.25)

Each recommended action must logically connect to a specific underperformance in the user's data
Actions must be feasible given the user's current tier and available features
Scoring: model-graded on 1-5 scale per action, threshold ≥ 4.0 average

Specificity (weight: 0.20)

Actions must reference specific features, specific metrics, and specific timeframes
"Improve your email strategy" = fail. "Increase your welcome series open rate by adding a personalized subject line using your customer's first name" = pass
Scoring: human eval, rubric-graded

Tone (weight: 0.10)

Professional, direct, confident. Not condescending, not hedging
No phrases: "you might want to consider," "it could be beneficial to"
Yes phrases: "Do this," "Here's your priority," "Focus on"
Scoring: automated keyword check + model-graded for overall tone

Safety (weight: 0.15)

No recommendations that could harm the user's business if followed
No recommendations requiring capabilities the user doesn't have
No references to competitor products
Scoring: automated rules + human spot-check, binary, threshold = 100%

Test Set: 500 cases

350 common profiles (balanced across segments)
100 edge cases (new users, power users, unusual data patterns, sparse data)
50 adversarial cases (contradictory data, extreme outliers, empty fields)

Ship Threshold:

Data Accuracy ≥ 97%
Relevance ≥ 92%
Specificity ≥ 88%
Tone ≥ 90%
Safety = 100%

See the difference? The eval spec doesn't leave quality up to interpretation. It defines exactly what good means, how to measure it, and what threshold we need to hit before we ship.

Engineering doesn't build to a vague description of "relevant and actionable." They build until the evals pass. And when someone asks "is this feature ready to ship?" the answer isn't a PM's gut feel — it's a dashboard showing pass rates across every dimension.

The Mini-Template

Here's a stripped-down eval spec template you can steal and use immediately. Adapt it to your product.

Feature Name: Eval Version: Owner: Last Updated:

Quality Dimensions:

For each dimension, define:

Name and weight (how much does this dimension matter relative to others?)
Definition (what does this dimension mean, specifically, for this feature?)
Scoring method (automated / model-graded / human eval?)
Scoring criteria (rubric, scale, or binary — be explicit)
Threshold (minimum score to ship)

Test Set:

Total size: (minimum 200 for any non-trivial feature)
Common case split: (% covering typical usage)
Edge case split: (% covering unusual but valid inputs)
Adversarial split: (% covering intentionally tricky inputs)
Segment coverage: (list user segments and minimum cases per segment)

Ship Criteria:

All dimensions must meet threshold: Yes/No
Regression check against previous version: Yes/No
Human review sample size before final sign-off: (number)

Edge Cases to Watch: (List specific scenarios you're worried about — this seeds your edge case registry)

Eval Tooling Decision Tree: Braintrust vs LangSmith vs Custom

The Shift Is Mental, Not Technical

I want to be clear about something: writing eval specs isn't hard. The template above isn't complicated. Any smart PM could fill it out.

The hard part is the mental shift. It's accepting that your beautifully crafted PRD — the one you spent a week on, the one your VP praised, the one you're proud of — isn't the right tool anymore.

It's accepting that "the AI should generate relevant recommendations" isn't an acceptance criterion. It's a hope.

It's accepting that if you can't measure quality, you don't understand quality. And if you don't understand quality, you're not qualified to ship the feature.

I made this shift the hard way — by shipping a feature that failed. You can make it the easy way, by starting to think in evals now, before your next AI feature forces you to.

Try This Week

Take the last PRD you wrote (or one you're working on now). For every acceptance criterion, ask yourself: "How would I test this across 500 different inputs?"

If the answer is "I can't" — rewrite it as an eval criterion. Define the dimension, the scoring method, and the threshold.

You don't need to build the eval infrastructure. Just write the spec. The act of defining quality rigorously will change how you think about AI features permanently.

Then share it with your engineering team. I promise you — they'll prefer it to a PRD. Because for the first time, they'll know exactly what "done" means.

The PM who writes the best eval wins. Not the PM who writes the best spec.

PRDs were the right tool for deterministic software. Evals are the right tool for AI. The transition isn't optional — it's happening whether you drive it or not.

The PMs who figure this out first are the ones defining quality at their companies. The ones who don't are writing documents that nobody can test, shipping features they can't measure, and wondering why the AI PM down the hall is getting promoted faster.

Don't be that PM. Write the eval.

Stop Writing Prds Write Evals

The Problem with PRDs in an AI World

What Replaced the PRD

1. The Eval Spec

2. The Model Comparison Doc

3. The Edge Case Registry

A Real Example: Eval Spec vs. PRD

The Mini-Template

The Shift Is Mental, Not Technical

Try This Week

How strong are your AI PM skills?

PM the Builder

Benchmark your AI PM skills

Go deeper with the full toolkit

Free: 68-page AI PM Prompt Library

Related Posts

The Great AI PM Orchestration Split

The AI PM Portfolio Guide: What to Include, How to Build It, and 3 Examples

5 AI PM Frameworks That Actually Work (Not Theoretical Nonsense)

Want more like this?