AI PM's Ultimate Guide: Observability

Everything you need to know about debugging AI products in production

Sep 07, 2025

∙ Paid

Every week, I talk to PMs whose AI products work flawlessly in demos but fail mysteriously in production. They share the same frustrations:

"It worked yesterday, now it's booking flights to the wrong city."
"Our AI is responding to the same ticket 50 times."
"Users say it's slow, but we can't figure out why."

These aren’t rare bugs. They’re systematic failure modes that creep in when you ship AI without observability.

And here’s the first big myth to clear up:
👉 Evals alone won’t save you.

Think of evals as exams. They’re great at telling you if your system passes or fails a test. But observability is the day-to-day monitoring that makes those exams meaningful. Without observability, evals are like taking a test in the dark.

Introducing Aman Khan

So, I brought in the expert in the field of AI PM Observability, Aman Khan.

His Maven course Prototype to Production starts October 13th. Use my code AAKASHxMAVEN to get $100 off.

Get $100 Off

What You'll Master in This Complete Guide

We’ve put together a battle-tested playbook (full of real stories) that will transform how you ship AI products:

The Basics of Observability
- The New AI PM Superpower
- The ROI of Observability
How to Implement Observability
- Observability 101
- Your 5-Minute Setup Plan
- 3 Levels of AI Observability
- Real Production Failures and Their Traces
Advanced Observability Topics
- The PM vs Engineer’s Role
- Advanced Techniques
- Implementation Path

Before we go any further, here’s a tactical takeaway to make sure you get value:

The Complete AI Agents Metrics Playbook

Now let’s continue with our guide…

1. Why Observability is a Key AI PM Skill

1a. The New AI PM Superpower

Traditional PM:

The AI seems off today

AI PM who understands observability:

The AI fails on queries >50 words when context exceeds 10 documents, specifically hallucinating technical features between 2-4 PM PST when our traffic peaks

The difference? The second one gets fixed.

The traditional PM files a vague bug report that sits in the backlog. Engineers can't reproduce "seems off."

They try a few random fixes - maybe adjust the temperature, maybe try a different model - but without clear success criteria, nothing really improves.

The PM gets frustrated, the engineers get defensive. And users? They keep suffering.

The AI PM who understands observability comes to the engineering discussion with receipts:

They show specific traces where long queries caused retrieval to pull too many documents, overwhelming the context window
They point to the timestamp pattern - the failures cluster during peak traffic when the cache hit rate drops
They even have a hypothesis: "Looks like our cache eviction during high traffic is forcing more retrievals, which pulls more documents, which breaks our context window"

We’ll help you be exactly that by the end of today’s piece.

We would posit that it is the second level of AI PM skills:

Here’s links to all those AI PM articles.

You shouldn’t run to build up the higher skills without observability.

1b. The ROI of Observability

The ROI is very high, and we’ll demonstrate it by dismantling the three most common objections we hear:

Objection 1 - "This is for developers - It’s too technical"

This is the top blocker. PMs picture digging through raw logs or writing Python scripts.

But the reality is modern observability tools are built for product teams. They show user journeys, not stack traces.

Instead of “Null pointer exception at line 187”
→ you see “AI retrieved 7 docs, generated response in 2.4s, ignored 3 docs.”
Instead of SQL joins and token counts
→ you see “Response took 3x longer than average” or “Hallucinated missing fields.”

One PM told Aman:

I was intimidated until I realized traces are just clickstream analytics for AI. Easier to read than Amplitude.

If you can read a funnel chart, you can read a trace.

Objection 2 - "We don't have time"

This one sounds reasonable. Shipping AI features is fast-paced. Why stop to add “extra tooling”?

But here’s the math:

Setup: 30 minutes
Debugging one real incident without observability: 2–3 days
Multiply that across quarters: = weeks of lost velocity

A fintech startup Aakash worked with spent two weeks debugging why their agent was giving reckless investment advice. Observability would’ve revealed in 5 minutes that a stray “be aggressive” string had slipped into a prompt template.

The fix: one line.
The damage: months of lost trust.

You don’t have time not to do this. The ROI is clear.

Objection 3 - "Our AI isn’t hallucinating or failing much"

The deadliest myth. AI fails gracefully: it looks right even when it’s wrong.

A travel bot books the wrong city 10% of the time
→ only the unlucky few complain
A recommendation engine suggests swimsuits to Alaskans and coats to Miami
→ CTRs look fine, revenue silently leaks

Without observability, you’re the frog in boiling water. Everything “seems fine” until it’s not.

Let’s help you avoid these key mistakes.

2. How to Implement Observability

Now that we've established why observability is essential for AI PMs, let's get practical.

You might be wondering: "This sounds great in theory, but how do I actually start?" The good news: you can go from zero to observability faster than you think.

We'll start with a dead-simple 5-minute setup that works today, then build up to production-grade observability that scales with your product.

Keep reading with a 7-day free trial

Subscribe to Product Growth to keep reading this post and get 7 days of free access to the full post archives.

A guest post by

Aman Khan

AI Product Guy