AI PM's Guide to LLM Judges

Most teams get evals wrong. Here’s how PMs build judges that actually work

, and

Sep 15, 2025

∙ Paid

2025 has been a wild year for PMs. New skills to learn have been sprouting up everywhere. And at the top of the list is AI Evals.

You don’t have to just take it from me. The CPO at OpenAI has said Evals are one of the most important new skills for PMs:

When new PM skills develop, I want to give you the web’s most tactical playbooks to execute them:

In my first post on AI Evals, I broke down the the "Analyze, Measure, Improve" lifecycle for systematically improving your product
Last week, I dove deep into the “Analyze” phase in depth with observability
Today, it's time for a deep dive into the "Measure" phase

The question we answer: Once you know what's broken, how do you measure it at scale?

TL;DR

a/b = real world performance
evals = predicted performance on known test cases

Re-Introducing Hamel and Shreya

I’ve brought back my Evals teachers for a follow-up piece. Hamel Husain and Shreya Shankar are the evals experts who even OpenAI and Anthropic go to:

I personally took their Evals course earlier this year and highly recommend it. Here’s an enormous 35% off for my community:

Get 35% Off

But First, Do You Even Need Evals?

For the past 4-5 months, it seemed like there was widespread consensus that evals were an important topic for PMs.

Then, this tweet happened:

It shared an enticing idea, in exactly the wrapper that AI engineers wanted to see:

The AI-pilled engineers have noticed evals aren’t a strict requirement for success and may even be anticorrelated.
Claude Code, the leading AI product for developers right now, doesn’t even do evals.

So do you really need evals? And how do you defend them to your AI engineers?

Here's the thing: everyone is already doing evals, they just don't call them "evals."

When you dogfood your product, analyze what feels off, and iterate based on user feedback, that's evaluation. It's systematic error analysis happening continuously.

The real question isn't whether you evaluate, but when you can afford to be less rigorous and when you cannot.

There are two cases where you can get away with lighter evals:

Your task is already well-represented in posttraining. Building a coding agent? Model providers have already done extensive coding evals during posttraining. You're benefiting from billions of dollars of upstream evaluation work.
Your team has deep domain expertise and religiously dogfoods. If you can look at outputs, immediately spot what's wrong, and iterate effectively based on feel, you might not need much more.

This explains why many successful AI products appear to be built without evals. They're either riding on upstream evaluation work or driven by teams who know their space inside and out.

The problem is that most teams don’t fall into these categories.

The majority of teams building AI products today are:

Working on novel applications not well-covered in posttraining
Still developing their domain expertise and product intuition
New to systematic error analysis and iterative improvement processes

For these teams—which is most teams—dismissing evals removes the very tools that help you understand what's working, what's not, and how to make progress.

This is why we see anti-eval sentiment as genuinely harmful for the community. It's steering new builders away from techniques that could accelerate their learning and improve their products.

We’re not going to say you need evals for everything, but we’re going to show you the PM side of evals for when you do need them.

Today’s Deep Dive

We’ve put together the AI PM’s ultimate guide to LLM Judges:

Why You Need LLM Judges
Why 90% of LLM Judges Still Fail
The Playbook for Trustworthy LLM Judges
Why Evals Are Your Competitive Moat
Why PMs Must Be Involved

Before we go any further, to make sure you have walk away with lots of value, here’s a full 38-page FAQ on Evals:

Evals Faq

3.39MB ∙ PDF file

Download

1. Why You Need LLM Judges

Here's the brutal truth: manual evaluation doesn't scale.

You can manually review 50 customer support conversations to understand what's broken. But you can't manually review 50,000. And without scale, you're flying blind on whether your latest prompt tweak actually moved the needle.

This is why every serious AI team eventually builds LLM judges - using one LLM to evaluate another LLM's output.

The appeal is obvious:

Scale infinitely: Evaluate thousands of responses in minutes
Catch subjective issues: Unlike rule-based checks, LLMs can assess tone, relevance, and reasoning quality
Cost effective: Cheaper than hiring human evaluators for every response

But most teams implement LLM judges wrong, get garbage metrics, and conclude "evals don't work."

The real problem isn't evaluation - it's bad evaluation.

So we're going to show you exactly how to build LLM judges that actually work. The step-by-step playbook that elite teams use to ship AI with complete confidence.

Keep reading with a 7-day free trial

Subscribe to Product Growth to keep reading this post and get 7 days of free access to the full post archives.

A guest post by

Hamel Husain

I am a machine learning engineer with over 20 years of experience. More about me @ https://hamel.dev

A guest post by

Shreya Shankar