How to measure AI features beyond vanity metrics

The Three-Layer Evaluation Stack LAYER 3 ProductionMonitoring Watch for drift,hallucinations, model… LAYER 2 Online Experiments Controlled experiments tiedto business outcomes LAYER 1 Offline Evals Golden set of real exampleswith LLM-as-a-Judge

Let's be honest for a second. Most AI teams right now are running on vibes.

We've all seen the roadmap. Everyone has at least one shiny generative AI feature in production. It demos great. The executives love seeing the text stream onto the screen. But if you ask the hard question—"is this actually making money or saving serious time?"—the room usually goes quiet.

The reality is pretty sobering. We're looking at a landscape where 78% of companies say they use AI in at least one function, yet only about 1% describe themselves as "mature" enough to see major business impact. Another study from 2025 hit us with the stat that 42% of companies actually abandoned most of their AI projects this year, up significantly from the year before. And despite tens of billions—specifically $33.9 billion in 2024—poured into this tech, roughly 95% of these projects fail to produce meaningful outcomes.

78%
Companies using AI
1%
Mature enough for impact
95%
Projects that fail
That gap exists because we are measuring the wrong things. We are obsessed with vanity metrics.

We look at prompt volume, feature click-through rates, session counts, or vague "delight" scores. But for product managers in 2025, the job has changed. It's no longer enough to say "people are using it." You have to be able to say "this feature changed a business KPI by X percent."

If you are a PM trying to ship LLM-powered products that actually survive the next budget cut, you need a different playbook.

The Vanity Trap

Here is the problem with how we used to measure things. Traditional accuracy metrics like precision, recall, and F1 scores are just the starting line. They don't tell you if the product is good. You could have a model achieving 95% accuracy, but if it takes 10 seconds to respond, the user experience is trash. Compare that to an 85% accurate model that responds in 200 milliseconds. Users might actually prefer the dumber, faster one because it doesn't break their flow.

Yet teams obsess over that accuracy number while users quietly abandon the product.

Key Insight
We need to stop treating AI features like science experiments and start treating them like product features. That means moving from "model scores" to "business outcomes."

Start with the Business, Not the Bot

With agentic AI and copilots spreading across the stack, evaluation feels overwhelmingly technical. It's easy to get lost in the weeds of token counts and cosine similarity. But the anchor has to be a metric you already care about.

Think about the "IMPACT" of the feature. You need to tie every AI PRD to a primary business metric. If you can't articulate this in a single sentence, you are probably still at the toy stage.

For a PM, this lives in familiar territory. Are you driving revenue? Stitch Fix grew revenue 88% largely driven by AI personalization that boosted average order value. That is a real number. You should be looking at incremental revenue per active user or conversion rate lift.

Or maybe you are focused on cost and efficiency. If you ship an AI support assistant, a bad metric is "Agent adoption" or "AI used in 60% of tickets." Who cares? A good metric is "Average handle time lower by 20% at the same CSAT."

You also have to watch the margins. AI-native products typically show gross margins of 50-65%, compared to the comfy 70-85% of traditional SaaS. You need to track the cost of goods sold, including those third-party inference costs. If your AI feature costs more to run than the value of the time it saves the user, you don't have a product. You have a subsidy.

The Workflow is the Product

AI usually works inside a workflow—drafting, summarizing, suggesting next actions. That is where you measure real change. You can't just measure the AI step; you have to measure the whole flow.

There was a report from Atlassian finding that developers save over 10 hours per week with AI tools. Sounds great, right? But they lose a similar amount of time to organizational friction, which neutralizes the benefit. If you only measured the AI tool, you'd think you were winning. If you measure the workflow, you realize you haven't actually moved the needle.

For workflows, try putting your metrics into a simple grid.

Outcome Metrics: Look at task success rate. What percentage of tasks reach the intended end state without a human having to step in and fix it? For an AI document assistant, track the percentage of drafts that ship with only minor edits.

Time Saved: But be careful here. Measure the median time on task before versus after the AI was introduced. And watch out for the "thinking time" trade-off. With new reasoning models like OpenAI's o1, the model takes longer to "think." You have to balance that depth against the user's tolerance for staring at a spinning wheel.

💡Quality Metrics: This is where the "Rework Rate" comes in. This is my favorite metric. It's the percentage of AI outputs that users discard and rewrite from scratch. If you see high usage but a high rework rate, your "usage" metric is a lie. Your users are just clicking the button, realizing the output is bad, and doing it themselves anyway.

The Three-Layer Evaluation Stack

So how do you actually measure this stuff without going crazy? Best practices emerging in 2025 point to a three-layer stack.

1. Offline Evals: The Golden Set

Before you roll anything out, you need a stable offline suite. You need a "golden set" of real examples—maybe 200 to 1,000 inputs covering key segments and edge cases. You run your model against these and score them.

Use "LLM-as-a-Judge" frameworks here. Let a stronger model evaluate the output of your production model. It's faster than humans and lets you iterate rapidly. But keep human spot checks. Models can be confident and wrong at the same time.

2. Online Experiments: Move Past CTR

Once offline evals look good, you run controlled experiments. Please, for the love of data, avoid "AI on by default for everyone" without a control group. Without a control, you will end up paralyzed, afraid to change prompts because you can't predict what will happen.

Design experiments around business outcomes. Randomly assign half your users to get the AI summary and half to do it the old way. Then track the downstream metric—like deal throughput or win rate—not just whether they clicked "summarize."

3. Production Monitoring: The Living System

No test suite covers everything. You need to treat AI like a living system. Watch for drift. Watch for spikes in hallucination flags.

This is especially important because providers ship new model versions all the time. Your prompt that worked perfectly in May might start outputting garbage in June because the underlying model shifted slightly. You need a dashboard that catches this before your users do.

The Portfolio View

If you are a head of product or a platform PM, looking at single features isn't enough. We are wrestling with "AI sprawl." Surveys suggest over 72% of organizations now integrate AI tools into multiple functions. You have disconnected copilots and chatbots popping up everywhere, leading to fragmented experiences and rising costs.

You need a portfolio-level view. Create a scorecard that rolls up the total AI-attributed revenue uplift or cost savings. Look at the aggregate risk posture. How many of your features are touching regulated or high-risk domains? This helps you decide where to double down and where to kill duplicative assistants.

Managing the New Weirdness

As we move deeper into 2025, the tech is getting weirder and more powerful, which complicates measurement.

We have Reasoning Models now. These things chain thoughts together. Benchmarks show they score incredibly high on math and coding tests, but does that matter for your user? You need to track transparency. Can the user see the reasoning steps? Does the extended wait time result in a better outcome, or just a slower wrong answer?

Then there are Agentic Workflows. Gartner predicts a third of enterprise apps will use autonomous agents by 2028. Measuring an agent is different. You need to track "Autonomous decision quality"—what percentage of agent decisions required a human override? You also need to look at safety in a new way. Is the agent doing what you asked, or is it finding a "clever" shortcut that violates policy?

And don't forget Multimodal inputs. We are processing text, images, and video all at once now. Measure whether these capabilities are actually enhancing the workflow or just adding complexity. Just because the model can analyze an image doesn't mean it should for every use case.

A Concrete Example

Let's make this real. Imagine you are adding an AI summarizer to a sales call recording product.

Don't just say "Goal: Launch AI summarizer." That's weak.

1
Step 1

Define the Business Goal

Goal: Increase account executive capacity so each rep can handle 20% more pipeline without lower close rates. Metric: Deals per rep per month.

2
Step 2

Design Workflow Metrics

Task Success: Percentage of calls where the AI summary is accepted with only minor edits. Guardrail: Customer NPS on "clarity of follow-ups." If the AI writes bad summaries and customers get confused, you lose.

3
Step 3

Build Offline Evals

Collect 500 real call transcripts. Run your prompt. Use a superior model (or human experts) to grade the summaries for factual correctness and coverage.

4
Step 4

Run the Experiment

Give the AI to half the sales team. Track their win rates over a full sales cycle.

5
Step 5

Monitor

Watch for hallucinations. If the AI invents a discount that the sales rep didn't offer, that's a legal exposure. Track the model cost per call. Is the summary worth the 15 cents it cost to generate?

If, after 90 days, you see a 15% lift in deals per rep and stable win rates, you have a story. You can go to the board and say, "We didn't just build a cool feature. We printed money."

The Bottom Line

Leading PM education programs are already teaching AI evaluation as a core skill, not a specialist concern. The teams that win the next few years will be the ones that treat AI evaluation as product work, not as an afterthought.

If your AI feature looks impressive on a slide but you cannot answer "what did this change in the business, by how much, and how do we know," you are still in the vanity phase. And in 2025, the vanity phase is over. The novelty has worn off. It's time to show the receipts.
AI & Product