When AI Agents Don't Know What They Don't Know

The security conversation around AI agents is missing the point.

Scroll through any CISO’s 2026 predictions and you’ll see the same concerns: agent identity, zero trust for non-human entities, least privilege access, prompt injection. These are real issues. They’re also the easy part.

The harder problem isn’t whether an agent is authorized to take an action. It’s whether the agent should take that action - whether its reasoning is sound, whether its confidence is warranted, whether it knows the limits of its own knowledge.

We’re deploying agents that make decisions with real consequences. They don’t just generate text; they execute. They book flights, move money, write code that goes to production, recommend courses of action that humans follow. And they do all of this with the same confident tone whether they’re right or catastrophically wrong.

This is the uncertainty problem, and almost nobody is working on it.

The Confidence Gap

Large language models are miscalibrated by design. When a model says “I’m 90% confident,” that number is essentially meaningless. It doesn’t track actual accuracy. A model can be 90% confident and wrong 40% of the time. Or right 99% of the time. The expressed confidence and the actual reliability are decoupled.

For chatbots, this is annoying. For agents, it’s dangerous.

flowchart LR subgraph reality [What Actually Happens] A1[Agent receives task] --> A2[Processes with
hidden uncertainty] A2 --> A3[Outputs confident
answer] A3 --> A4[Human trusts
confident output] A4 --> A5[Action taken] end U[Actual uncertainty
never surfaced] -.->|Invisible| A2 style U fill:#f66,stroke:#333 style A3 fill:#9f9

The confidence gap: internal uncertainty never surfaces. Humans see confident output and assume competence.

Consider what’s happening right now across enterprises: agents are being deployed for customer support, code review, financial analysis, legal research. Each of these involves judgment calls. Each involves situations where the right answer is “I don’t know enough to decide this” - but agents almost never say that. They produce an answer. They sound confident. They move on.

The humans in the loop see confident output and assume competence. Why wouldn’t they? The agent completed the task. It didn’t flag uncertainty. It didn’t ask for help. It just… decided.

This is how you get agents that confidently execute the wrong action, humans who trust the confident output, and errors that compound through systems before anyone notices.

Why Identity Isn’t Enough

The current security playbook for agents borrows from identity and access management. Treat agents like employees: verify who they are, limit what they can access, log what they do. Microsoft’s pushing it. CyberArk’s pushing it. NIST just dropped an RFI on agent security that reads like an IAM framework with “agent” find-and-replaced for “user.”

This addresses one failure mode: agents doing things they shouldn’t be allowed to do. It doesn’t address another: agents doing things they’re allowed to do, but shouldn’t.

flowchart TD subgraph current [Current Security Model] Q1{Is agent
authenticated?} Q2{Is action
authorized?} Q3{Is action
logged?} Q1 -->|Yes| Q2 Q2 -->|Yes| Q3 Q3 -->|Yes| E1[Execute] end subgraph missing [What's Missing] Q4{Is reasoning
sound?} Q5{Is confidence
warranted?} Q6{Should agent
ask for help?} Q4 -.->|Not checked| X1[???] Q5 -.->|Not checked| X2[???] Q6 -.->|Not checked| X3[???] end style Q4 fill:#ff9 style Q5 fill:#ff9 style Q6 fill:#ff9

Current security checks authorization. It doesn’t check whether the agent should act in this specific instance.

An agent with permission to execute trades can still make a terrible trade. An agent authorized to respond to customers can still give confidently wrong advice. An agent cleared to summarize intelligence can still miss the key insight or overstate confidence in weak sources.

Permissions are necessary. They’re not sufficient. You can lock down what an agent can do and still have no insight into whether the agent should do it in any given instance.

The Problem With Verbalized Confidence

One obvious response: just ask the agent how confident it is. Include uncertainty in the output. Force the model to express doubt.

This doesn’t work, and we have the research to prove it.

When you prompt a model to express confidence, you get verbalized confidence - the model’s stated belief about its own reliability. Studies consistently show this is poorly calibrated. Models are reluctant to express uncertainty. When they do express it, the levels don’t track accuracy. A model that says “I’m fairly confident” isn’t meaningfully more likely to be correct than one that says “I think” or “probably.”

Reinforcement learning from human feedback makes this worse. RLHF optimizes for responses that humans rate highly, and humans rate confident-sounding responses higher than hedged ones. We’ve trained models to sound sure of themselves even when they shouldn’t be.

You can’t solve this by prompting harder. The miscalibration is structural.

Looking Inside the Model

The alternative is to stop asking the model what it thinks and start observing what it’s actually doing.

During inference, transformer models build internal representations layer by layer. These representations encode more than what surfaces in the output. They encode the model’s processing state - including, it turns out, information that correlates with reliability.

flowchart TD subgraph model [Transformer Inference] I[Input] --> L1[Layer 1] L1 --> L2[Layer 2] L2 --> LN[Layer N] LN --> LK[Layer K] LK --> O[Output] end LK -.->|Extract activations| P[Uncertainty Probe] P --> U{Actual uncertainty?} U -->|Low| OK[Proceed] U -->|High| Flag[Flag for review] style P fill:#9f9 style Flag fill:#ff9

Don’t ask the model - observe it. Activation probes can detect uncertainty the model doesn’t verbalize.

The research direction I’ve been pursuing asks a simple question: can you train probes on intermediate activations to predict when a model is likely to be wrong?

Early results suggest yes. There are directions in activation space that correlate with response accuracy, independent of what the model says about its own confidence. You can detect uncertainty that the model doesn’t verbalize. You can catch cases where the model “knows” it’s on shaky ground but produces confident output anyway.

This isn’t science fiction. It’s a natural extension of interpretability research that’s been progressing for years. The same techniques that let you identify what concepts a model has learned can help you identify when those concepts are being applied shakily.

What This Means for Agents

Apply this to agents and you get something useful: runtime uncertainty monitoring that doesn’t depend on the agent’s self-report.

Imagine an agent system where:

Every decision-relevant inference is monitored for activation patterns that predict unreliability
High-uncertainty states trigger human review rather than autonomous execution
Confidence estimates are calibrated against actual outcomes and updated continuously
Agents that “know they don’t know” can be distinguished from agents that are confidently wrong

This isn’t a replacement for identity controls or access management. It’s a different layer - one that addresses whether the agent’s reasoning is trustworthy, not just whether the agent is authorized.

flowchart TD subgraph layers [Complete Agent Security Stack] L1[Layer 1: Identity
Who is the agent?] L2[Layer 2: Authorization
What can it access?] L3[Layer 3: Behavioral
Is it acting normally?] L4[Layer 4: Uncertainty
Should it act now?] L5[Layer 5: Audit
What did it do?] end L1 --> L2 --> L3 --> L4 --> L5 style L4 fill:#9f9,stroke:#333

Uncertainty monitoring is Layer 4 - between behavioral analysis and audit. It catches bad decisions before they execute.

The practical architecture looks like this: lightweight probe models running alongside agent inference, sampling activations at key layers, flagging anomalies that suggest the agent is operating outside its reliable range. The compute overhead is minimal. The information gain is significant.

The Harder Problem: Uncertainty Propagation

Single-inference uncertainty is tractable. The harder problem is what happens when agents reason through multiple steps.

Most agent architectures involve chains: observe, reason, plan, act, observe again. Uncertainty at early steps compounds. If the agent’s situation assessment is 70% reliable, and its planning given that assessment is 80% reliable, and its action selection given that plan is 90% reliable, you’re looking at ~50% end-to-end reliability. And that’s assuming independence, which is optimistic.

flowchart LR subgraph chain [Reasoning Chain] S1[Observe
70% reliable] --> S2[Reason
80% reliable] S2 --> S3[Plan
85% reliable] S3 --> S4[Act
90% reliable] end S4 --> E[End-to-end:
~43% reliable] style E fill:#f66

Uncertainty compounds through reasoning chains. 90% reliable steps can produce 50% reliable outcomes.

Current agent frameworks don’t track this. They don’t propagate uncertainty through reasoning chains. They don’t distinguish “I’m confident because the whole chain is solid” from “I’m confident but three steps back I was guessing.”

This matters most for the agents taking the most consequential actions. Decision support agents that synthesize information, generate options, evaluate tradeoffs, and recommend actions. Every step is an opportunity for uncertainty to accumulate invisibly.

Solving this requires moving beyond single-point confidence estimates to something more like belief propagation through the agent’s reasoning graph. It’s tractable, but it’s not what anyone is building. The agent frameworks getting deployed in 2026 have no concept of uncertainty propagation. They just execute.

Why This Matters Now

Three things are converging:

Agents are actually deploying. 2025 was pilots and proofs of concept. 2026 is production. The MIT Sloan review is right that agents hit the Gartner trough of disillusionment - but disillusionment doesn’t mean abandonment. It means the serious deployments are filtering out the hype and moving forward. Real agents, real decisions, real consequences.

Reasoning models change the stakes. DeepSeek R1 and its descendants show that models can develop sophisticated reasoning through reinforcement learning alone. These models think longer, explore alternatives, self-correct. They’re more capable. They’re also more opaque. A model that reasons through a complex chain before answering gives you more opportunities for something to go wrong invisibly.

The security conversation is stuck. Everyone’s talking about agent identity because that’s what we know how to talk about. It maps to existing frameworks. It’s actionable with existing tools. But it doesn’t address the core problem: agents that are authorized to act but shouldn’t because their reasoning is unreliable in the specific instance.

The organizations that figure out agent uncertainty in 2026 will have a meaningful advantage. Not because they’ll deploy agents faster - everyone’s doing that - but because they’ll deploy agents they can actually trust. And when agent failures start making headlines, the gap between “agents with calibrated confidence” and “agents that just execute” will become very visible.

What’s Missing

I don’t want to oversell this. There are real gaps:

Calibration without ground truth. In production, you often don’t know whether the agent was right until much later, if ever. Calibrating uncertainty estimates requires some ground truth. Novel situations - the ones where uncertainty matters most - are exactly the ones where historical data is sparse.

Cost of monitoring. Activation-level monitoring is cheap but not free. For high-throughput agent systems, even small per-inference costs add up. There’s a tradeoff between monitoring fidelity and operational overhead.

Human-agent interface. Even if you can detect agent uncertainty, communicating it effectively to human operators is its own problem. A dashboard full of confidence scores doesn’t help if operators don’t know what to do with them.

Adversarial robustness. If attackers know you’re using activation monitoring to detect uncertainty, they might find ways to manipulate activations to hide unreliable states. This needs red-teaming that hasn’t happened yet.

These are research problems, not blockers. They’re solvable. But they’re not solved yet.

Where This Goes

The trajectory is clear: agents need metacognition. Not the fake metacognition of “let me reflect on my answer” prompting, but genuine self-knowledge - awareness of their own reliability, calibrated to reality, available at inference time.

The models themselves are getting more capable every quarter. The agent frameworks are maturing. The deployments are scaling. What’s not keeping pace is our ability to know when to trust them.

The current conversation - identity, permissions, audit logs - is about controlling agents. The next conversation needs to be about understanding agents. Not just what they did, but why. Not just what they said, but how confident we should be that they’re right.

That’s the work. Activation-level uncertainty detection. Calibration methods that work without exhaustive ground truth. Propagation through reasoning chains. Interfaces that communicate uncertainty in ways humans can act on.

If you’re building agents that make decisions with real consequences, this should be on your roadmap. The tools are emerging. The research is progressing. The question is whether you’ll have calibrated confidence in your agents before something goes wrong, or after.

Research on this problem is ongoing at rotalabs.ai. If you’re working on agent reliability or uncertainty quantification, I’m interested in comparing notes: [email protected]

The Confidence Gap

Why Identity Isn’t Enough

The Problem With Verbalized Confidence

Looking Inside the Model

What This Means for Agents

The Harder Problem: Uncertainty Propagation

Why This Matters Now

What’s Missing

Where This Goes

Cite this post

Stay Updated

Related Posts

Field-Theoretic Memory for AI Agents

Runtime AI Control for Contested Environments

MCP is Infrastructure. Trust is the Missing Layer.