GlassBox: A Cross-Pattern Audit of Linear Probe Gates
A pre-registered, null-result benchmark for monitors that gate irreversible agentic actions.
Research on AI agent reliability, evaluation, verification, and memory. We link a paper only when there's a real artifact behind it — no placeholder IDs, no fabricated venues.
A pre-registered, null-result benchmark for monitors that gate irreversible agentic actions.
Continuous Dynamics for Context Preservation
A memory architecture treating stored information as continuous fields governed by partial differential equations — semantic diffusion, thermodynamic decay, and field coupling for persistent, composable agent memory.
Semantic MAP-Elites red-teaming across frontier LLMs
A quality-diversity evolutionary framework that red-teams LLMs at the semantic level — evolving interpretable attack strategies (not token sequences) and using a MAP-Elites archive to illuminate distinct vulnerability profiles across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight model. The contribution is an interpretable, reproducible baseline a safety team can triage and defend against.
CE2P translates formal-verification failures into structured LLM feedback. The benefit is inversely correlated with model capability — weaker models gain the most.
A formal framework for ROI-based decision routing in multi-tier verification systems.
A comprehensive threat model covering input, state, tool, planning, and coordination attacks, with empirical evaluation across agent architectures.
Interpretability methods to detect adversarial attacks on agentic systems — extending metacognitive probing to security.
Uncertainty quantification that tracks actual accuracy — activation-based estimation and propagation through multi-step reasoning chains.
Defensive publications by the founding team during prior work at Google. Published via TD Commons, CC BY 4.0.
We welcome collaborators across AI safety, interpretability, formal methods, and adversarial ML.
Get in touch →Deploy reliability infrastructure in production. All packages are open source, with enterprise support via Rotascale.
View packages →All research, benchmarks, and tools are open source. Issues, PRs, and discussions welcome.
GitHub →Type to search across all pages and posts
Press ↑ ↓ to navigate, Enter to select