New Open source packages now available on PyPI & npm Read more →

Coming Soon

SandbagEval

Benchmark for detecting strategic underperformance in AI models.

Launching Q2 2026

TrustBench

Comprehensive evaluation of AI system trustworthiness.

In development

AgentTrust

Benchmark for multi-agent system trust properties.

In development

Available Now

DistSynth-50

50 distributed systems synthesis tasks for evaluating verified code generation.

Stay Updated

Join the waitlist