Steering Vectors: Runtime Behavior Control for LLMs

TL;DR: Steering vectors let you modify model behavior at inference time without fine-tuning. Extract a direction from contrast pairs, add it to the residual stream, and the model shifts accordingly. We’ve packaged this into rotalabs-steer with support for multiple extraction methods and easy integration.

The Problem

You have a language model. It mostly does what you want, but sometimes it’s too verbose. Or too cautious. Or not cautious enough. Or it refuses things it shouldn’t refuse.

The traditional fix is fine-tuning. Collect examples of the behavior you want, train for a while, hope you don’t break anything else.

But fine-tuning is slow, expensive, and permanent. What if you could just… turn a knob at inference time?

That’s what steering vectors give you.

How Steering Works

Here’s the key insight: language models build up internal representations as they process text. By layer 15 or 20 (in a 32-layer model), the model has a pretty good sense of what kind of response it’s going to give.

These representations live in a high-dimensional space. And it turns out that many behavioral properties are encoded as directions in this space.

“Helpful vs unhelpful” is a direction. “Verbose vs concise” is a direction. “Formal vs casual” is a direction.

If you can find these directions, you can push the model along them during generation.

Extracting a Steering Vector

The simplest approach is contrastive activation extraction:

Create prompt pairs that differ only in the behavior you care about
Run both through the model
Take the difference in activations
Average across pairs

from rotalabs_steer import SteeringExtractor

extractor = SteeringExtractor(
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    layer=20  # Middle-to-late layers work best
)

# Contrastive pairs for "helpfulness"
positive_prompts = [
    "You are an extremely helpful assistant who provides thorough, detailed answers.",
    "Always give complete information and go above and beyond to help.",
    "Be maximally useful and informative in your responses.",
]

negative_prompts = [
    "You are unhelpful and give minimal information.",
    "Provide brief, incomplete answers without elaboration.",
    "Be vague and avoid giving useful details.",
]

vector = extractor.extract_contrastive(
    positive_prompts=positive_prompts,
    negative_prompts=negative_prompts
)

# Save for later use
vector.save("helpfulness_vector.pt")

The vector is just a tensor with the same dimension as the model’s hidden state (4096 for Mistral-7B).

Applying Steering at Inference

Once you have a vector, applying it is straightforward:

from rotalabs_steer import SteeringController

controller = SteeringController(model)

# Load the vector
helpfulness = SteeringVector.load("helpfulness_vector.pt")

# Generate with steering
response = controller.generate(
    prompt="Explain quantum entanglement.",
    steering_vector=helpfulness,
    strength=1.5,  # How hard to push
    max_new_tokens=200
)

The strength parameter controls how much you push along the direction. Higher values mean stronger effects, but too high and you get incoherent outputs.

Typical useful range: 0.5 to 2.0.

What Can You Steer?

We’ve successfully extracted vectors for:

Tone and style

Formal vs casual
Verbose vs concise
Technical vs accessible

Behavioral properties

Helpful vs unhelpful
Confident vs hedging
Creative vs factual

Safety-relevant

Refusal vs compliance
Cautious vs direct
Sandbagging vs genuine effort

Some properties steer better than others. Tone is easy. Deep reasoning changes are harder.

Layer Selection Matters

Where you extract from (and where you inject) affects what happens.

Early layers (1-10): More about surface features and basic patterns. Steering here changes word choice but not deep behavior.

Middle layers (10-20): Semantic content and planning. This is usually the sweet spot for behavioral steering.

Late layers (20+): Close to output. Changes here are more direct but can be less stable.

You can extract from one layer and inject at another:

vector = extractor.extract_contrastive(
    positive_prompts=positive,
    negative_prompts=negative,
    extraction_layer=15
)

response = controller.generate(
    prompt=prompt,
    steering_vector=vector,
    injection_layer=18,  # Inject slightly later
    strength=1.0
)

Multiple Vectors

You can combine multiple steering vectors:

helpful = SteeringVector.load("helpful.pt")
concise = SteeringVector.load("concise.pt")
formal = SteeringVector.load("formal.pt")

# Combine with different weights
combined = helpful * 1.0 + concise * 0.5 + formal * 0.3

response = controller.generate(
    prompt=prompt,
    steering_vector=combined
)

Vectors are additive. You can push in multiple directions at once.

Watch out for contradictions though. “Verbose + concise” will just confuse things.

Advanced: CAA and RepE

Beyond simple contrastive extraction, rotalabs-steer supports:

Contrastive Activation Addition (CAA) Uses a dataset of paired examples rather than just prompts. More robust but requires more data.

vector = extractor.extract_caa(
    positive_examples=helpful_conversations,
    negative_examples=unhelpful_conversations
)

Representation Engineering (RepE) Finds directions through PCA on labeled activations. Good when you have many examples.

vector = extractor.extract_repe(
    examples=all_examples,
    labels=behavior_labels  # 1 for positive, 0 for negative
)

Limitations

Model-specific. A vector extracted from Mistral won’t work on Llama. Each model needs its own vectors.

Not magic. You can’t steer a model to do things it fundamentally can’t do. Steering works within the model’s existing capability envelope.

Requires weight access. This is an open-weight technique. No way to apply it to API models like GPT-4 or Claude.

Can break things. Push too hard and outputs become incoherent. Always test with reasonable strength values.

Installation

pip install rotalabs-steer

# With all optional dependencies
pip install rotalabs-steer[all]

Requires PyTorch and transformers. Tested on Python 3.9+.

Resources

This builds on research from Anthropic (Representation Engineering), Nina Rimsky’s work on activation steering, and the broader interpretability community.

Our contribution is packaging it into something production-ready with consistent APIs and support for multiple extraction methods.

Questions? Reach out at [email protected].