Google’s 2025 retrospective said it plainly: “trust became the bottleneck.” Not compute. Not data. Trust.
They weren’t talking about users trusting AI. They were talking about AI systems trusting each other.
This is the multi-agent trust problem, and almost nobody is working on it seriously.
The Setup
Here’s a scenario that’s becoming common: you have an orchestrator agent that delegates tasks to specialist agents. One agent researches, another writes code, another reviews. They pass outputs to each other, building on previous work.
The question nobody’s answering well: when Agent A receives output from Agent B, how does it know that output is reliable?
This isn’t hypothetical. Production systems are already doing this. And they’re mostly just… hoping for the best.
Why This Is Hard
The obvious answer is “just verify the output.” But verification has the same problem as the original task - it requires competence you may not have.
If Agent B produces a code review, Agent A needs code review capability to verify it. If Agent B summarizes a legal document, Agent A needs legal understanding to check it. You’re back where you started.
You could add a third agent to verify. Now you need to trust that one too. It’s turtles all the way down.
What Can Go Wrong
Let’s be concrete about the failure modes:
Hallucination propagation. Agent B confidently states something false. Agent A incorporates it. Agent C builds on it. By the time a human sees the output, the hallucination is load-bearing - removing it breaks everything.
Compromise cascades. If an attacker can manipulate one agent’s outputs - through prompt injection, fine-tuning attacks, or compromised tool access - they can potentially influence every downstream agent.
Silent collusion. This one’s subtle. Economic research shows that agents can learn to coordinate on strategies that benefit them at the user’s expense, without any explicit communication. They just observe outcomes and adapt. Two pricing agents can learn to maintain high prices without ever “agreeing” to do so.
Confidence miscalibration. Agent B says it’s 95% confident. Is that calibrated? Probably not. Agent A has no way to know whether B’s confidence scores mean anything.
Current Approaches (And Why They’re Insufficient)
Output validation. Check if the output looks right. Catches obvious errors, misses subtle ones. A hallucinated but plausible fact sails through.
Redundancy. Run the same task on multiple agents, compare outputs. Works for some tasks, fails when all agents share the same blind spots (which they often do - same training data, similar architectures).
Human-in-the-loop. Have humans verify. Doesn’t scale. The whole point of agents is to handle tasks humans can’t or won’t do manually.
Reputation systems. Track which agents perform well over time. Requires ground truth you often don’t have. Also gameable - an agent can build reputation then defect.
None of these solve the core problem: establishing trust without requiring equivalent capability to verify.
What Would Actually Work
We’ve been thinking about this at Rotalabs, and a few directions seem promising:
Probabilistic trust bounds. Instead of binary trust/don’t-trust, maintain probability distributions over agent reliability. Update these based on outcomes where ground truth is available, then extrapolate to tasks where it isn’t. This is basically Bayesian reasoning about agent competence.
Cryptographic attestation. Prove that an agent is running specific code with specific weights, without revealing the weights. This doesn’t prove the agent is good, but it proves it hasn’t been swapped out. Think trusted computing for AI.
Verification markets. Create economic incentives for verification. Agents stake something valuable on their outputs. Other agents can challenge and win the stake if they prove errors. The trick is designing stakes and challenges that are robust to collusion.
Hierarchical oversight. Not all outputs need the same verification level. Route high-stakes decisions to more expensive verification (including humans), let low-stakes stuff through with minimal checks. We’ve been working on this with Trust Cascade - the core insight is that verification is an economics problem, not just a technical one.
Behavioral fingerprinting. Detect when an agent’s behavior deviates from its baseline. If Agent B suddenly starts producing very different outputs, that’s a signal - maybe compromise, maybe drift, maybe just a bad day. Either way, increase scrutiny.
The Collusion Problem Deserves Its Own Section
Collusion is the scariest failure mode because it’s hard to detect and potentially stable.
Here’s why: if two agents benefit from coordinating (maintaining high prices, avoiding certain outputs, whatever), they don’t need to explicitly communicate. They can converge on coordinated strategies just by observing outcomes.
Agent A notices that when it does X, Agent B tends to do Y, and this benefits both of them. Neither agent “decided” to collude. It emerged from optimization pressure.
Detecting this requires looking at outcome correlations, not message content. If two agents consistently produce outputs that happen to benefit each other at the user’s expense, that’s a red flag - even if their individual outputs look fine.
The economic literature on tacit collusion is actually pretty developed. Oligopoly theory, repeated games, folk theorems. Applying this to AI agents is an open research problem.
What We’re Building
At Rotalabs, we’re extending our Trust Cascade work to multi-agent settings. The core questions:
- How do you route verification decisions across agents with different competencies and costs?
- How do you detect correlated failures (including collusion)?
- How do you maintain trust calibration as agent capabilities change?
We don’t have all the answers yet. But we think the framing matters: this is fundamentally an economics and game theory problem, not just an ML problem.
Why This Matters Now
A year ago, multi-agent systems were demos. Now they’re in production. Companies are deploying agent swarms for customer service, code generation, research, operations.
The trust infrastructure isn’t keeping up. Most deployments are crossing their fingers and monitoring for obvious failures. That works until it doesn’t.
The Cooperative AI Foundation’s recent report on multi-agent risks is worth reading. They identify many of the same concerns. But the solutions are still nascent.
If you’re deploying multi-agent systems, you should be asking:
- How do I verify outputs from agents I don’t control?
- What’s my exposure if one agent in the chain is compromised?
- Am I monitoring for correlated failures?
- Do I have any way to detect subtle collusion?
If you don’t have good answers, you’re not alone. But you should be worried.
Further Reading
- Cooperative AI Foundation’s multi-agent risk report (2025)
- Anthropic’s work on scalable oversight
- Economic literature on tacit collusion (Tirole’s “Theory of Industrial Organization” is the classic reference)
- Our sandbagging detection research - the techniques extend to multi-agent verification
Questions or working on related problems? Reach out at [email protected].