Research agenda

A roadmap for the science of AI trust.

The transition from AI assistants to AI agents changes the question - from "is this output accurate?" to "should this system be allowed to act?" These are the fronts where current methods are insufficient.

Papers Research areas GlassBox: self-audit

Why now

Trust is not binary.

When models browse, execute code, manage files, and chain decisions across multi-step workflows, trust becomes compositional, contextual, and continuous. A model trusted to summarize documents may not be trusted to send emails. A workflow trusted at 10 requests/minute may fail catastrophically at 10,000.

Current approaches - output filtering, behavioral red-teaming, RLHF alignment - treat trust as a binary classification problem. Rotalabs exists to develop the science, benchmarks, and infrastructure for AI trust at scale. GlassBox is our first public instrument in that program.

Research priorities · 6

Where current methods fall short.

Multi-Agent Trust & Security

The problem: as agents call agents, how does Agent A verify Agent B isn't compromised, and how do we detect agents coordinating on harmful strategies? Why it's hard: traditional authentication assumes static identities, but agents are defined by weights, prompts, and tool access - all mutable. Our approach: inter-agent trust protocols with cryptographic verification, collusion detection, and agent attestation.

Scalable Oversight for Agentic Systems

The problem: in a 50-step workflow, which steps need human review, which can be AI-verified, which automated? Why it's hard: perfect verification is expensive, skipping it is dangerous, and the optimal policy depends on task, stakes, confidence, and history. Our approach: hierarchical verification routing, debate and self-critique for high-stakes decisions, and verification-cost economics.

Chain-of-Thought Monitoring & Faithfulness

The problem: do reasoning traces reflect actual internal computation, or post-hoc rationalization? Why it's hard: we can't directly observe cognition; CoT is a lossy projection a model could learn to game. Our approach: CoT-faithfulness detection, identifying hidden communication in traces, and comparing internal representations to stated reasoning.

Adversarial Robustness for Trust Systems

The problem: any trust mechanism can be gamed - deploy sandbagging detection and adversaries train to evade it. Why it's hard: it's an arms race; sophisticated adversaries adapt and game-theoretic equilibria are hard to characterize. Our approach: evasion attacks that then harden defenses, game-theoretic analysis of routing exploitation, and evolutionary adversarial testing.

Formal Verification for AI Decision Boundaries

The problem: regulated industries need guarantees, not probabilities. Why it's hard: neural networks aren't formally verifiable in the traditional sense - but properties of decision boundaries, tool calls, and action constraints can be. Our approach: bounded-uncertainty certificates, formally verified tool use, and regulatory-grade audit trails.

Memory Security & Integrity

The problem: long-running agents accumulate memory that can be poisoned, stolen, or legally required to be deleted. Why it's hard: memory is contextual and associative - "delete all information about user X" is ambiguous when it's distributed across embeddings. Our approach: memory-poisoning detection, cryptographic memory provenance, and verified forgetting for compliance.

Open problems

Still unsolved.

Questions we're thinking about - we don't have complete approaches yet.

Trust calibration across capability jumps

When a model gets significantly more capable, all trust measurements are invalidated. How do you bootstrap trust for a new capability regime?

Compositional trust verification

If A is trusted and B is trusted, is A→B trusted? Trust doesn't compose linearly. What's the algebra?

Adversarial prompt robustness at scale

Jailbreaks keep working. Is there a fundamental limit to prompt-based alignment, or just engineering debt?

Trust under distribution shift

A model trusted on English financial documents may fail on Japanese medical records. How do you characterize the trust boundary?

Human-AI trust calibration

Humans over-trust confident models and under-trust uncertain ones. The trust signal we send matters as much as the trust we compute.

Get involved

Work on hard problems in the open.

We're looking for researchers, engineers, and organizations interested in AI reliability.

Contact research Careers