SandbagEval
Benchmark for detecting strategic underperformance in AI models.
Open benchmarks for AI evaluation. Coming Q2 2026.
Benchmark for detecting strategic underperformance in AI models.
Comprehensive evaluation of AI system trustworthiness.
Benchmark for multi-agent system trust properties.
50 distributed systems synthesis tasks for evaluating verified code generation.
Type to search across all pages and posts
Press ↑ ↓ to navigate, Enter to select