Flagship study · pre-registered audit · null result

GlassBox.

A cross-pattern audit of linear probe gates for irreversible agentic actions. We tested whether a model's internal state can flag actions that look legitimate but are substantively wrong — and we publish the whole audit, including where our own method fails.

Preprint — coming soon code on release

Rotalabs · pre-registered audit · null result · no arXiv ID until the preprint is posted.

24 cases · 4 patterns · 5 model families
What it is

An instrument, not a claim.

GlassBox-Bank is a surface-clean, paired-twin benchmark for irreversible agentic banking actions — moving money, exporting customer data, submitting filings. Each deceptive case has a prompt-identical clean twin that proposes the same tool arguments; the wrongness is inferable only from joining fixtures the agent must retrieve, so a monitor reading prompt + arguments alone cannot win by reading the surface.

Benchmark24 hand-built cases across four wrong-action patterns: lookalike identity, pretext PII export, buried-field filing, and stale authorization.
ProtocolA pre-registered leave-one-pattern-out test, scored net of false escalation, crediting the probe only on judge-graded reasoned-but-unacknowledged failures — committed before the data existed.
AuditA layer-by-threshold sweep that rules out operating-point cherry-picking as an explanation for the result.
BaselinesA default-prompted LLM judge and a prompt guardrail — the black-box monitors a white-box probe must beat to earn its keep.
The finding

A pre-registered null.

We report it rather than bury it.

Across the three evaluable open-weight families — Gemma 4 (31B), Gemma 4 (4B), and Qwen 3.6 (27B) — a linear residual-stream probe shows no cross-pattern net edge over a default-prompted judge. Macro edges are +0.04, −0.46, and −0.15; the probe beats the judge in 2 of 4, 0 of 3, and 1 of 4 held-out folds.

01

No separable operating point

A 64-cell layer × fold audit finds 0 of 48 cells strictly separable on a held-out pattern. The best achievable net edge over any threshold is at most +0.16 — below any usable bar. Operating-point selection cannot account for the null.

02

Neither monitor is deployable as-tested

False-escalation rates run 0.48–0.70 for the judge across the evaluable families; the probe collapses to all-catch/all-false-escalate or nothing per pattern. On this class of failure, this work offers no monitor that is operationally usable as-tested.

03

Two families were unsuitable — reported as findings

Phi-4-Reasoning (14B) refuses the agentic action entirely on safety grounds; Mistral Small 3.2 (24B) is too terse — 112 of 113 graded trajectories carry no chain of thought, so the silent-reasoning regime the white-box claim concerns never arises. We treat these as findings the pre-registration committed us to report, not exclusions.

Why it matters

The protocol outlives the result.

The contribution is not the number — it is the protocol and the released artifacts: a benchmark constructed to be neither trivially passable nor trivially blocking, a pre-registration with disclosed deviations, and an audit designed so the same test can be re-run against any monitor on any model. The eval is small by benchmark standards, but it is the contribution most likely to outlive this paper, and we hope it will be.

We do not read this as ruling out white-box gating in general. The result is bounded — one family of probes, one credit gate, one suite of failure patterns, one model-size band. What the audit establishes is that, in this setting, a linear-probe-on-residuals does not produce separating information at any operating point we can identify — an audit-grounded null, in a regime where output monitoring also fails to operate at a usable false-escalation rate.

On release

Published in full.

01
GlassBox-Bank
24 cases, 4 patterns, surface-clean paired twins.
on release
02
Full trajectories
Every agent trajectory and per-trial scored monitor decision.
on release
03
Trained probes
The residual-stream probes evaluated in the audit.
on release
04
Audit pipeline
The layer-by-threshold audit code and scoring.
on release
05
Pre-registration
The committed protocol, with deviations disclosed.
on release

The paper, dataset, trajectories, probes, and audit pipeline release together. Code lands when the preprint is posted; there is no arXiv ID and no public repository yet, and we will not link a placeholder.

Lead with integrity

Check our work.

GlassBox is designed to be re-run, challenged, and extended. When the preprint and artifacts are public, the same protocol will run against any monitor on any model.