Observability Is Broken for AI Systems
Traces, metrics, and logs were designed for deterministic systems. When an agent makes 40 tool calls across three services to complete a task, your existing observability stack tells you almost nothing useful.
I have a well-instrumented system. OpenTelemetry traces end-to-end, Prometheus metrics on every service, structured JSON logs with correlation IDs. I can trace a request through eight microservices in under a second. I know exactly what broke and when.
Then I added an AI agent layer and my observability became nearly useless for the problems that actually matter.
The traces are there. The logs are there. But the questions I need to answer — why did the agent do that, where did it go wrong, what state was it in when it made that call — those questions don't have answers in my existing instrumentation.
What observability was built for
Traditional observability assumes a deterministic execution graph. A request comes in, it follows a predictable path through your system, you trace that path. When something breaks, the trace shows you where the latency was, which service threw the error, which database query ran slow.
The entire mental model is: deterministic system, observable state, predictable failure modes. Your job as an operator is to instrument the execution and reconstruct what happened from the data.
AI agents break every assumption in that model.
An agent's execution path is not predetermined. Given the same task in slightly different context, it might make a completely different sequence of tool calls. It might revisit earlier steps. It might take a roundabout path that happens to produce the correct output. It might produce a wrong output with clean, successful traces at every step — because nothing in your system failed, the agent just reasoned incorrectly.
A successful trace through an agent pipeline can mask a completely wrong outcome. That's a property of deterministic systems that most engineers assume is universal. It isn't.
The three gaps your current stack has
Gap 1: You can trace execution but not reasoning.
When your agent makes a tool call — reads a file, queries a database, calls an API — you can trace that call. Latency, status code, payload. Standard stuff.
What you can't observe: why the agent decided to make that call. What information in its context led to that decision. Whether the information it acted on was correct. Whether its interpretation of the tool's output was accurate.
You have a complete trace of the "what." You have zero observability into the "why." In a deterministic system, the "why" is implicit in the code. In an agent system, the "why" is a sequence of reasoning steps that happened inside a model context window and left no artifact.
Gap 2: Token usage is not a meaningful latency proxy.
Your APM dashboard shows the HTTP response time for calls to your LLM provider. That number is nearly useless as an operational metric.
A fast response can contain completely wrong reasoning. A slow response can mean the model was doing deep, correct analysis of a complex problem. Response time and reasoning quality are uncorrelated. The metric you care about — did the agent accomplish the task correctly — is not observable from timing data.
Gap 3: Error rates don't capture agent failure modes.
Your standard error rate metric counts exceptions and HTTP errors. Agent failure modes are mostly invisible to that metric:
- The agent completed its task but did the wrong thing
- The agent got stuck in a loop of redundant tool calls
- The agent confidently produced output based on misunderstood context
- The agent took a 15-step path to something that should have taken 3 steps
None of these show up as errors. They show up as costs you don't understand, latency you can't explain, and outcomes you discover later in review.
What you actually need to instrument
The shift is from instrumenting execution to instrumenting reasoning state.
Capture the full context window at decision points. When your agent makes a significant decision — choosing which tool to call, deciding a task is complete — log the context state that led to that decision. Not just the output. The input: what the agent knew, what it had already done, what it was trying to accomplish.
This is expensive in storage. It's also the only way to reconstruct why an agent did something when you need to investigate it. You're essentially keeping a reasoning journal alongside the execution trace.
Measure task-level outcomes, not step-level success.
The granularity that matters isn't individual tool calls. It's: did the agent accomplish the task it was given, and how efficient was the path? Define this differently per task type. For a code-generation agent: did the output pass the test suite? How many iterations were required? How many tool calls per correct output? These are the metrics that tell you whether your agent is operating effectively.
Track context utilization.
Token count per session is a cost metric. What you want is context utilization rate: what fraction of the context window was spent on work directly relevant to the task versus orientation, re-reading, and redundant operations? A high-quality agent working in a well-structured codebase spends most of its context doing. A struggling agent in a messy environment spends half its context trying to figure out where it is.
Instrument for backtracking and loops.
When an agent returns to a tool it already called, with the same or similar inputs, flag it. Loops are one of the more expensive agent failure modes and they're largely invisible without explicit instrumentation. A simple counter on repeated tool calls per session gives you an early signal.
The deeper problem: evaluation is the missing layer
What I've described above handles operational observability — are my agents working correctly in production. There's a more fundamental gap: most teams have no continuous evaluation layer at all.
For deterministic systems, your test suite is your evaluation layer. Run it, pass or fail, merge or don't. The system's behavior is defined by the tests.
An agent's behavior is defined by the interaction between the model, the prompt, the tools, and the incoming context — none of which are fully captured by a unit test suite. Your agent can pass all its tests and silently regress in production when the model provider updates the underlying model, when the tool schema changes, or when the real-world inputs differ from your test cases in ways you didn't anticipate.
The teams that have solved this run continuous evaluation against a fixed set of representative tasks — real tasks from their backlog, not synthetic benchmarks — and track quality metrics over time. Not as a one-time eval before shipping. As a persistent signal that runs on every deployment and alerts when agent quality drops.
This isn't optional for systems where the agent is doing consequential work. It's the equivalent of your health checks, but for reasoning quality.
What the tooling landscape looks like right now
The good news: the category is real and maturing. LangSmith, Braintrust, Langfuse, and Arize all offer observability tooling that extends beyond standard APM for LLM-based systems. They're not complete solutions, but they've built for the gaps I've described — context capture, quality metrics, eval pipelines.
The bad news: none of them integrate cleanly with your existing observability stack. You end up with two parallel systems — OpenTelemetry for your services, something agent-specific for your AI layer — and manual correlation between them. It works, but it's fragile.
The teams that have this working well have treated it as a first-class engineering problem, not an afterthought. They've built custom instrumentation that captures the reasoning state they care about, integrated it into their existing trace infrastructure, and defined quality metrics specific to their agent's task types.
That's more work than dropping in a library. It's also the work that separates teams that know their AI systems are operating correctly from teams that are hoping they are.
Work with me
I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.
Get in touch →Related posts
Explore more on these topics: