Agent Performance Review

Key Takeaways

Traditional monitoring tells you if your agent is running. It doesn’t tell you if your agent is working. CPU, latency, and error rates are table stakes — they’re nowhere near enough for agentic systems.

Agent evaluation is a new discipline. It requires goal-level assessment across multi-step sessions. An agent can nail every individual response and still completely miss what the user actually needed.

My take: We give human employees 90-day reviews, quarterly check-ins, and annual evaluations. We give AI agents a deployment and a prayer. That gap is where production failures hide.

The 85% Trap

I keep seeing the same story play out.

A team deploys a customer service agent. Testing shows 85% resolution rate. The team celebrates. Leadership signs off. Agent goes live.

Three weeks later, CSAT drops 12%. Escalation rates spike 40%. The engineering team pulls up their dashboards — API calls successful, latency normal, error rates flat. Green across the board.

Everything looks fine. Everything is broken.

What actually happened? The agent was answering questions correctly but missing context. Resolving tickets technically but leaving customers feeling unheard. Following instructions to the letter while violating the spirit of every interaction.

Traditional monitoring caught none of it. Because traditional monitoring was built for traditional software. And agents are not traditional software.

Why Your Dashboards Are Lying to You

Inspecting AI Agent Behavior

The core problem is simple: APM tools give you CPU, memory, HTTP codes, and error rates. None of that tells you whether your agent made the wrong call, picked the wrong tool, or hallucinated something that sounded perfectly reasonable.

LangChain’s production monitoring team puts it well — you can’t monitor agents like traditional software. Inputs are infinite, behavior is non-deterministic, and quality lives in the conversations themselves, not in the status codes.

Three gaps that keep biting teams I work with:

The non-determinism gap. Same input, different output, different day. Traditional monitoring assumes deterministic behavior. Agents don’t work that way. Your agent might take three steps on Monday and seven on Tuesday — both correct. Or three steps correctly Monday, three steps catastrophically wrong Tuesday. Your latency dashboard can’t tell the difference.

The composition gap. Agent failures are compositional. A single run involves dozens of micro-decisions: which tool to call, in what order, with what arguments, how to interpret results. Each decision might be reasonable in isolation. The sequence might be a disaster. It’s like evaluating someone by checking if they typed each email correctly — while ignoring they sent confidential data to the wrong person.

The drift gap. Agents drift silently. The data they access changes. Tools get updated. Users shift their behavior. An agent that crushed it in March might be quietly degrading by May — not because anything in the agent changed, but because everything around it did. Microsoft’s observability report from March 2026 flags exactly this: most organizations lack “continuous visibility into how systems behave in production.”

The Performance Review Framework

In my previous article on responsible agentic AI, I argued we should treat agents like employees, not software. Let me push that further.

If agents are employees, they need performance reviews. Not dashboards. Reviews.

Here’s the framework I’ve been using with enterprise customers. It maps directly to how you’d evaluate a human team member — because the failure modes are surprisingly similar.

The 90-Day Review

Every agent gets a structured evaluation 90 days after deployment.

Goal completion rate. Not “did the agent respond?” — “did the agent achieve what the user actually needed?” This means evaluating at the session level, not the turn level. An agent can produce perfect individual responses and completely whiff on the user’s actual intent. Latitude.so’s research calls this “goal-level assessment across multi-turn sessions.” It’s the most important metric most teams aren’t tracking.

Tool usage patterns. Which tools is the agent calling? How often? In what order? Are there tools it should be using but isn’t? Tools it’s over-relying on? Same as checking whether an employee is using the right resources or just the ones they’re comfortable with.

Escalation quality. When the agent hands off to a human, is the handoff clean? Does the human have context? Or does the customer repeat everything from scratch? Bad escalation is the agent equivalent of “not my department.”

Boundary adherence. Is the agent staying within its defined decision rights? If you mapped those rights (and you should have — see my governance article), the 90-day review checks whether the boundaries actually held in the wild.

The Quarterly Check-In

Every quarter, run a systematic drift check:

Baseline comparison. Compare current metrics against deployment baseline. Goal completion was 85% at launch, now it’s 78%? Something shifted. The question is what.

Data drift. Has the underlying data changed? New products, updated policies, changed pricing, reorganized knowledge bases — any of these silently degrade performance without touching the agent’s code.

User behavior drift. Are users asking different things than three months ago? Seasonal patterns, product launches, market events — all change what users need. An agent tuned for Q1 questions might struggle with Q2 reality.

Tool drift. Have any tools been updated? API changes, schema modifications, new rate limits. The agent’s environment isn’t static, even when the agent is.

The Annual Evaluation

Once a year, zoom out:

Is this agent still solving the right problem? Business needs evolve. The problem the agent was built for might have changed, shrunk, or disappeared.

What’s the real total cost? Not just compute and API bills. Include human time spent monitoring, fixing, working around the agent. Include the cost of bad outcomes. Include opportunity cost of what the team could’ve built instead.

Should this agent still exist? Hardest question. Sometimes the answer is no. Sometimes a simpler solution emerged. Sometimes the problem shifted enough that the agent is solving yesterday’s challenge. Killing an underperforming agent is as important as deploying a good one.

What I’m Actually Seeing

Across the enterprise customers I work with in Southeast Asia, the pattern is depressingly consistent: teams invest heavily in building and deploying agents, then invest almost nothing in evaluating them afterward.

One financial services company I advised had 14 agents in production. When I asked how they evaluate performance, the answer was: “We check if they’re running.” That’s it. No goal completion tracking. No drift detection. No structured reviews. Fourteen agents operating with less oversight than a single intern.

The few organizations getting this right share three things:

Dedicated evaluation pipelines. Not dashboards — pipelines. Automated systems that continuously sample interactions, run them through evaluation criteria, and flag degradation before users notice it. Arthur.ai describes this as “the control plane that turns autonomous behavior into measurable, auditable outcomes.”

Evaluation as an engineering function. The team that built the agent owns its ongoing performance. Evaluation is part of the sprint, not something someone remembers to do six months later.

Closed feedback loops. Review findings actually lead to changes. A quarterly check that spots drift triggers retraining. A 90-day review revealing poor escalation quality leads to prompt refinement. The review isn’t a PDF in a shared drive. It’s a trigger for action.

Where to Start

If you have agents in production and no evaluation framework — which, honestly, describes most enterprises I talk to — here’s the minimum viable approach:

Pick one agent. Your highest-traffic or highest-risk one. Don’t boil the ocean.

Define three metrics that matter. Goal completion, boundary adherence, and one metric specific to your use case (CSAT, accuracy, cost per resolution — whatever maps to actual business value).

Manually review 50 interactions. Not automated. Manual. Read the transcripts. Follow the agent’s reasoning. See where it nails it and where it falls apart. This is the equivalent of sitting in on someone’s meetings before writing their review.

Schedule the 90-day review. Put it on the calendar. Make it a meeting. Invite the team that built the agent and the team that uses it. Look at the data together. Decide what to fix.

One agent. Three metrics. Fifty interactions. One review. You’ll learn more about your agent’s real performance in that exercise than in six months of staring at dashboards.


Your AI agents are making decisions on behalf of your organization every day. The question isn’t whether they need performance reviews. It’s how long you can afford to wait before giving them one.

If you’re building agent evaluation into your enterprise — or struggling with it — I’d love to hear your approach. Connect with me on LinkedIn or subscribe to the newsletter.


Sources: IBM — Observability in the Agentic Era, Microsoft — Observability for AI Systems (Mar 2026), LangChain — How to Monitor and Evaluate LLM Agents in Production, Arthur.ai — Agentic AI Observability Playbook 2026, Latitude.so — Complete Guide to Evaluating AI Agents in Production