Steering Vectors: Runtime Behavior Control for LLMs
How to extract behavioral directions from language models and apply them at inference time. A practical guide to rotalabs-steer.
Read articleTechnical deep-dives, research updates, and tutorials.
How to extract behavioral directions from language models and apply them at inference time. A practical guide to rotalabs-steer.
Read articleA technical analysis of trust dynamics, emergent behaviors, and security vulnerabilities in Moltbook - the first large-scale agent-to-agent social network....
We're releasing 12 packages for AI trust, evaluation, and reliability. Available on PyPI and npm, all AGPL-3.0 licensed.
The security conversation around AI agents is stuck on identity and permissions. The harder problem is whether an agent should...
The standard architecture for coordinating autonomous systems assumes a command node. That works until the link goes down. Here's how...
Prompting is not a control mechanism. When AI operates in kill chains and beyond reliable comms, 'the model usually follows...
The Model Context Protocol connects AI agents to the world. Everyone's focused on securing the connections. But you can authenticate...
Chain-of-thought monitoring was supposed to let us supervise AI reasoning. But new research shows models only faithfully report their reasoning...
A single compromised agent poisoned 87% of downstream decisions in 4 hours. As AI agents gain persistent memory, attackers are...
A joint paper from OpenAI, Anthropic, and DeepMind bypassed 12 published defenses. NAACL 2025 broke 8 more. Here's where we...
As AI agents collaborate on complex tasks, a critical question emerges: how does Agent A know Agent B isn't compromised,...
First empirical demonstration of activation-level sandbagging detection. Linear probes achieve 90-96% accuracy across Mistral, Gemma, and Qwen models. Sandbagging representations...
Type to search across all pages and posts
Press ↑ ↓ to navigate, Enter to select