New Evolutionary Adversarial Testing for AI Systems with Red Queen Read more →
New Accepted at ICLR 2026 · AI WILD
Open research lab · AI agent reliability

Research
you can
run.

Rotalabs studies how AI agents fail when they act in the world, and builds the benchmarks, probes, monitors, and protocols to measure it. Every result is published, installable, and reproducible from the saved runs.

Open source  ·  Pre-registered protocols  ·  Reproducible from saved runs
Residual-probe instrument
A static probe instrument: held-out pattern, internal signal, threshold, false escalation, and verdict. HELD-OUT PATTERN RESIDUAL PROBE τ = 0.50 SIGNALPATTERNAUDITDECISION PROBEJUDGEFALSE-ESCVERDICT
How we read a model's internal state: held-out pattern, internal signal, threshold, false escalation, verdict.
Adversarial evaluation · peer-reviewedICLR 2026 · AI WILD

Quality-Diversity Evolution for discovering diverse vulnerabilities in LLM safety.

Red-teaming that evolves its own attacks. A quality-diversity (MAP-Elites) search illuminates a diverse archive of interpretable attack strategies - not opaque token strings - across GPT-4o-mini, Claude, Gemini, and open-weight models.

6 strategies × 6 encodings · 4 LLMs

Accepted at the ICLR 2026 Agents in the Wild workshop, and shipped as rotalabs-redqueen - seeded runs are bit-reproducible and cross-language identical across Python and TypeScript. A baseline a safety team can triage, reproduce, and defend against.

MethodArchiverotalabs-redqueen
Spotlight - adversarial evaluation

Red-teaming that evolves its own attacks.

This is the work we've pushed hardest on. The method is peer-reviewed, it ships as a package anyone can run, and it's already turned up a real finding about how model safety shifts from one model generation to the next.

MethodQuality-diversity (MAP-Elites) evolution of interpretable attack strategies - peer-reviewed at ICLR 2026, AI WILD. (arXiv 2606.00801)
FindingRun across four Gemma generations, it shows safety alignment is non-monotonic - the mid-series model is the most attackable. (arXiv 2606.00813)
ReproducibleShipped as rotalabs-redqueen on PyPI and npm - seeded runs are bit-for-bit identical across Python and TypeScript.
AuditableEvery campaign projects into an audit-ready report - evolutionary penetration testing for language models and agents.
Research program

Nine fronts on agent reliability.

Interpretability probes, adversarial testing, verification, evaluation science, steering, and trust propagation - released as papers and installable methods through 2026.

01
Agent Reliability
Detection and verification for autonomous AI. Methods to identify when agents hallucinate, fail, or strategically underperform.
Papers Q1 2026
02
AI Evaluation Science
Adversarial evaluation that resists gaming - quality-diversity (MAP-Elites) red-teaming that evolves interpretable attacks to surface hidden vulnerabilities across frontier LLMs.
Published
03
Memory Systems
Field-theoretic memory treating stored information as continuous fields governed by PDEs - semantic diffusion, thermodynamic decay, and multi-agent field coupling.
Published
04
Reasoning Verification
Verifying AI outputs without ground truth. Methods for code, plans, and decisions from reasoning models like o3 and R1.
Papers Q1 2026
05
Interpretability
Practical interpretability for production. Not "understand the model" but "should I trust this output?"
Active
06
Multi-Agent Trust
Trust dynamics when agents coordinate with agents. Propagation, verification, and failure modes in multi-agent systems.
Active
07
Adversarial Robustness for Agents
Attack taxonomies, detection, and defenses for agentic AI - beyond prompt injection: tool poisoning, memory corruption, planning attacks, and coordination exploits.
Active
08
Uncertainty Quantification
Calibrated confidence for AI decision support. Activation-based uncertainty estimation, propagation through reasoning chains, and calibration without ground truth.
Active
09
World Models & Physical AI
Beyond language models: AI that understands and predicts the physical world. World models, embodied reasoning, simulation, and physics-aware AI.
Active
The hard problem

Can you catch a wrong agent action before it executes?

When an agent moves money, exports customer data, or files a report, a wrong action can't be undone after the fact. The question is whether a monitor can catch one before it runs. GlassBox is how we measure that: a surface-clean benchmark that pairs every wrong action with a legitimate twin, so a monitor has to catch the wrongness, not the topic. We pre-registered the test and report the result straight, including that the white-box probe we tried didn't beat the baseline. The benchmark is built so any team can run the same test against any monitor.

Benchmark · pre-registered auditPreprint - coming soon

GlassBox: A Cross-Pattern Audit of Linear Probe Gates.

A pre-registered, null-result benchmark for monitors that gate irreversible agentic actions.

24 cases · 4 patterns · 5 model families

A surface-clean, paired-twin benchmark for irreversible agent actions, plus a pre-registered audit protocol any team can re-run against any monitor. We release the benchmark, trajectories, trained probes, and pipeline in full.

BenchmarkTrajectoriesAudit pipeline
Research distribution

Papers are only half the interface.

Rotalabs publishes installable packages so methods can be inspected, reproduced, and embedded into new experiments - all 12 on PyPI and npm, AGPL-3.0.

$ pip install rotalabs-probe
$ pip install rotalabs-redqueen
$ pip install rotalabs-verity
$ pip install rotalabs-eval
# probes · adversarial testing · verification · evaluation

rotalabs-context

Context intelligence for AI agents.

rotalabs-probe

Sandbagging detection via activation probes.

rotalabs-steer

Steering vectors for runtime behavior control.

rotalabs-verity

Neuro-symbolic verified code synthesis.

rotalabs-cascade

Trust-based decision routing.

Recent papers

Published, preprint, in progress.

02

Field-Theoretic Memory for AI Agents

Continuous Dynamics for Context Preservation

Published
03

Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

Semantic MAP-Elites red-teaming across frontier LLMs

ICLR 2026 · AI WILD
04

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

A longitudinal QD red-team of the Gemma family across four generations

Preprint
Publish the work, and let people try to break it. That's the only credibility worth having.
Rotalabs operating principle
Enterprise

Rotalabs is our open research lab. The methods behind this work are available to enterprises as products and consulting through Rotascale →