DATE

April 28, 2026

READING TIME

minutes

Your AI System Is Running. That Doesn't Mean It's Working.

I watched a consulting firm implement an agentic AI system for proposal generation and contract compliance. The vendor promised magic: connect your data sources, and the system would help with past...

Daniel Cohen-Dumani

>_ Founder and CEO

The 200 OK Status Lie

An AI agent can have 99% uptime and still fail to follow user intent.

Traditional monitoring confirms a request succeeded with a 200 OK status and acceptable latency. It cannot detect when an agent selects the wrong tool, gets trapped in a reasoning loop, or confidently generates information that is false.

The technical infrastructure says "working" while the reasoning layer says "broken."

This isn't a theoretical problem. Over 50% of organizations have already deployed AI agents, but most lack continuous runtime monitoring of how these systems actually behave in production.

You're flying instruments that measure altitude and fuel but can't tell you if you're headed toward the right destination.

How the Discovery Actually Happens

The proposal generation system I mentioned ran for weeks before anyone caught the fabrications. The discovery didn't come from monitoring alerts or automated tests.

It came from an insurance audit.

The team needed contracts meeting specific criteria. The AI returned a list. Someone on the team knew there had to be more contracts than what appeared. They started digging. Manual research. Subject matter expert review. The gaps became obvious.

No matter what prompt engineering they tried, the problem persisted. The reality: there were too many contracts for a search-based RAG tool to reason over with that level of detail at runtime.

The system never logged a warning. It just quietly underperformed.

This is what AI introduces that traditional software doesn't: silent degradation. A model stays technically operational while gradually producing outputs that have quietly stopped being useful.

The Inference Gap Traditional Tools Can't See

Typical RAG search systems used by most agents retrieve context on the fly. They work great for simple use cases where the answer is obviously listed in a document.

They underperform when asked a question that requires understanding context and relationships between concepts that aren't explicit in the document itself.

Humans infer relationships almost instantly when answering questions. We connect dots that aren't drawn for us. RAG search systems look for similarity—which breaks down the moment you need synthesis instead of retrieval.

How did we discover this limitation? Trial and error. There wasn't a log or metric that raised suspicion. The model itself felt confident it had found everything.

You can't alert on what you can't measure. And you can't measure reasoning quality with infrastructure metrics.

When One Error Cascades Through the Entire Workflow

I've seen a customer ask a question about a contract clause. The request was misinterpreted at step one. That misinterpretation shaped the retrieval in step two, which determined the reasoning in step five, which drove the tool call in step eight.

The final output was wrong. But you couldn't trace it back to the initial mistake without behavioral telemetry.

Traditional monitoring sees the symptom. It can't see the causal chain.

Agent failures compound across steps. A flawed retrieval early in the process creates downstream reasoning errors that appear unrelated. Without visibility into the agentic layer, you're debugging in the dark.

What Traditional Monitoring Actually Tells You

I've asked Prometheus, Datadog, and similar tools plenty of questions they fundamentally can't answer:

Which pieces of organizational knowledge did this agent actually rely on when it produced this answer, and how did it combine them step by step?

These tools can tell you CPU, memory, latency, error rates, and which endpoint or model was called. They cannot tell you:

Which proposals, decks, emails, or graph nodes the AI treated as relevant inside your firm's institutional memory
How it traversed your knowledge graph or workflows—what it looked at first, what it discarded, what it doubled down on
Where it overrode or ignored your consulting playbooks and methodologies

Infrastructure metrics answer whether your system is running. They don't answer whether it's running correctly.

That gap is the difference between operational health and behavioral correctness.

The Telemetry Explosion You're Not Prepared For

A typical RAG pipeline hitting a vector database, retrieving context, calling an LLM, and post-processing the response generates 10-50x more telemetry data than an equivalent traditional API call.

AI workflows create vastly more decision points that need monitoring. Each one is a potential failure node.

But volume isn't the problem. The problem is that traditional tools were designed for systems that fail discretely and loudly. Modern AI systems fail continuously and quietly.

Threshold-based monitoring doesn't work when there's no threshold to cross. The system just drifts.

What Behavioral Telemetry Actually Looks Like

When I talk about capturing what the model actually did with the context, I mean recording a full behavioral trace of the agent:

Which parts of the firm's knowledge graph it touched
What intermediate plans and tool calls it made
Which documents it relied on
Which fields or artifacts it ultimately changed

Instead of just storing prompts and answers, you log perception, reasoning steps, actions in your stack, and the guardrails that fired.

That gives you replayable, auditable telemetry for every AI-driven workflow.

You stop asking "did it load?" and start asking "did it understand and store this in the right place, with the right meaning and protections?"

How Testing Changes When You Measure Behavior

If you're measuring "is it behaving correctly" instead of "is it responding," your testing approach shifts in concrete ways:

From connectivity checks to semantic checks

Instead of: "Did we connect to SharePoint and pull N documents?"

You test: "For this specific proposal, did the system extract the right client, industry, deal size, outcomes, and link it to the correct project node in the knowledge graph?"

From "no errors" to "no silent corruption"

Instead of: "The ingestion job finished without exceptions."

You test: "Did we drop sections that matter—pricing, risks, lessons learned—because of a parser edge case? Did we mis-tag confidential client data so it becomes searchable by the wrong team?"

From throughput to future queryability

Instead of: "We can ingest 10k docs/hour with latency under X."

You test: "After ingestion, can a consultant retrieve this case by asking in natural language: 'Show me European banking cost-out projects 2022–2024 with greater than 10% opex reduction'—and does the system actually hit the right graph slice?"

You turn ingestion tests into search tests. Define scenario queries and expected document sets. Run them after ingestion and fail the build if recall drops below thresholds.

The Cost of Silent Failures

Organizations without AI observability face significant financial exposure. Single incidents have been documented costing $1.5 million or more.

When expertise-based systems fail quietly, the blast radius is massive. A fabricated case study in a proposal. An incomplete contract list during an audit. A misinterpreted clause that cascades into bad advice.

The failure doesn't announce itself. It just compounds.

Why Deployment Is Outpacing Visibility

Most organizations are deploying AI agents faster than they're building the observability to monitor them.

The result is a growing surface of silent failures, data exposure, and uncontrolled automation.

You're not just facing a monitoring gap. You're facing an institutional risk that scales with every agent you deploy.

Traditional monitoring was built for predictable systems. AI breaks that model completely. No error codes. No deterministic paths. Just subtle drift, hallucinations, and non-reproducible edge cases that legacy sampling strategies will never catch.

You need instruments designed for the system you're actually running.

What This Means for Your Firm

If you're deploying AI agents in your consulting practice, ask yourself:

Can you trace how your AI arrived at a specific recommendation?

Do you know which organizational knowledge it relied on and which it ignored?

Can you detect when it's confidently wrong before a client does?

If the answer is no, you're not monitoring AI behavior. You're monitoring infrastructure health and hoping the rest works out.

That's not a strategy. That's exposure.

The firms that recognize this gap early get an advantage. The firms that wait get disruption.

Your AI system might be running. But unless you can see what it's actually doing with your institutional memory, you have no idea if it's working.

Your Cart

DATE

CATEGORY

READING TIME

Your AI System Is Running. That Doesn't Mean It's Working.

Daniel Cohen-Dumani

Recent Post

Your AI System Is Running. That Doesn't Mean It's Working.

Why Your AI System Feels Healthy But Keeps Failing

Stop Buying AI Tools. Start Fixing Workflows.

The 200 OK Status Lie

How the Discovery Actually Happens

The Inference Gap Traditional Tools Can't See

When One Error Cascades Through the Entire Workflow

What Traditional Monitoring Actually Tells You

The Telemetry Explosion You're Not Prepared For

What Behavioral Telemetry Actually Looks Like

How Testing Changes When You Measure Behavior

The Cost of Silent Failures

Why Deployment Is Outpacing Visibility

What This Means for Your Firm

Experience AI-Knowledge
Like Never Before

Product

Company

Resource

DATE

CATEGORY

READING TIME

Your AI System Is Running. That Doesn't Mean It's Working.

Daniel Cohen-Dumani

Recent Post

Your AI System Is Running. That Doesn't Mean It's Working.

Why Your AI System Feels Healthy But Keeps Failing

Stop Buying AI Tools. Start Fixing Workflows.

The 200 OK Status Lie

How the Discovery Actually Happens

The Inference Gap Traditional Tools Can't See

When One Error Cascades Through the Entire Workflow

What Traditional Monitoring Actually Tells You

The Telemetry Explosion You're Not Prepared For

What Behavioral Telemetry Actually Looks Like

How Testing Changes When You Measure Behavior

The Cost of Silent Failures

Why Deployment Is Outpacing Visibility

What This Means for Your Firm

Experience AI-Knowledge Like Never Before

Product

Company

Resource

Experience AI-Knowledge
Like Never Before