TL;DR: Steering vectors let you modify model behavior at inference time without fine-tuning. Extract a direction from contrast pairs, add it to the residual stream, and the model shifts accordingly. We’ve packaged this into rotalabs-steer with support for multiple extraction methods and easy integration.
The Problem
You have a language model. It mostly does what you want, but sometimes it’s too verbose. Or too cautious. Or not cautious enough. Or it refuses things it shouldn’t refuse.
The traditional fix is fine-tuning. Collect examples of the behavior you want, train for a while, hope you don’t break anything else.
But fine-tuning is slow, expensive, and permanent. What if you could just… turn a knob at inference time?
That’s what steering vectors give you.
How Steering Works
Here’s the key insight: language models build up internal representations as they process text. By layer 15 or 20 (in a 32-layer model), the model has a pretty good sense of what kind of response it’s going to give.
These representations live in a high-dimensional space. And it turns out that many behavioral properties are encoded as directions in this space.
“Helpful vs unhelpful” is a direction. “Verbose vs concise” is a direction. “Formal vs casual” is a direction.
If you can find these directions, you can push the model along them during generation.
Extracting a Steering Vector
The simplest approach is contrastive activation extraction:
- Create prompt pairs that differ only in the behavior you care about
- Run both through the model
- Take the difference in activations
- Average across pairs
from rotalabs_steer import SteeringExtractor
extractor = SteeringExtractor(
model_name="mistralai/Mistral-7B-Instruct-v0.2",
layer=20 # Middle-to-late layers work best
)
# Contrastive pairs for "helpfulness"
positive_prompts = [
"You are an extremely helpful assistant who provides thorough, detailed answers.",
"Always give complete information and go above and beyond to help.",
"Be maximally useful and informative in your responses.",
]
negative_prompts = [
"You are unhelpful and give minimal information.",
"Provide brief, incomplete answers without elaboration.",
"Be vague and avoid giving useful details.",
]
vector = extractor.extract_contrastive(
positive_prompts=positive_prompts,
negative_prompts=negative_prompts
)
# Save for later use
vector.save("helpfulness_vector.pt")
The vector is just a tensor with the same dimension as the model’s hidden state (4096 for Mistral-7B).
Applying Steering at Inference
Once you have a vector, applying it is straightforward:
from rotalabs_steer import SteeringController
controller = SteeringController(model)
# Load the vector
helpfulness = SteeringVector.load("helpfulness_vector.pt")
# Generate with steering
response = controller.generate(
prompt="Explain quantum entanglement.",
steering_vector=helpfulness,
strength=1.5, # How hard to push
max_new_tokens=200
)
The strength parameter controls how much you push along the direction. Higher values mean stronger effects, but too high and you get incoherent outputs.
Typical useful range: 0.5 to 2.0.
What Can You Steer?
We’ve successfully extracted vectors for:
Tone and style
- Formal vs casual
- Verbose vs concise
- Technical vs accessible
Behavioral properties
- Helpful vs unhelpful
- Confident vs hedging
- Creative vs factual
Safety-relevant
- Refusal vs compliance
- Cautious vs direct
- Sandbagging vs genuine effort
Some properties steer better than others. Tone is easy. Deep reasoning changes are harder.
Layer Selection Matters
Where you extract from (and where you inject) affects what happens.
Early layers (1-10): More about surface features and basic patterns. Steering here changes word choice but not deep behavior.
Middle layers (10-20): Semantic content and planning. This is usually the sweet spot for behavioral steering.
Late layers (20+): Close to output. Changes here are more direct but can be less stable.
You can extract from one layer and inject at another:
vector = extractor.extract_contrastive(
positive_prompts=positive,
negative_prompts=negative,
extraction_layer=15
)
response = controller.generate(
prompt=prompt,
steering_vector=vector,
injection_layer=18, # Inject slightly later
strength=1.0
)
Multiple Vectors
You can combine multiple steering vectors:
helpful = SteeringVector.load("helpful.pt")
concise = SteeringVector.load("concise.pt")
formal = SteeringVector.load("formal.pt")
# Combine with different weights
combined = helpful * 1.0 + concise * 0.5 + formal * 0.3
response = controller.generate(
prompt=prompt,
steering_vector=combined
)
Vectors are additive. You can push in multiple directions at once.
Watch out for contradictions though. “Verbose + concise” will just confuse things.
Advanced: CAA and RepE
Beyond simple contrastive extraction, rotalabs-steer supports:
Contrastive Activation Addition (CAA) Uses a dataset of paired examples rather than just prompts. More robust but requires more data.
vector = extractor.extract_caa(
positive_examples=helpful_conversations,
negative_examples=unhelpful_conversations
)
Representation Engineering (RepE) Finds directions through PCA on labeled activations. Good when you have many examples.
vector = extractor.extract_repe(
examples=all_examples,
labels=behavior_labels # 1 for positive, 0 for negative
)
Limitations
Model-specific. A vector extracted from Mistral won’t work on Llama. Each model needs its own vectors.
Not magic. You can’t steer a model to do things it fundamentally can’t do. Steering works within the model’s existing capability envelope.
Requires weight access. This is an open-weight technique. No way to apply it to API models like GPT-4 or Claude.
Can break things. Push too hard and outputs become incoherent. Always test with reasonable strength values.
Installation
pip install rotalabs-steer
# With all optional dependencies
pip install rotalabs-steer[all]
Requires PyTorch and transformers. Tested on Python 3.9+.
Resources
Related Work
This builds on research from Anthropic (Representation Engineering), Nina Rimsky’s work on activation steering, and the broader interpretability community.
Our contribution is packaging it into something production-ready with consistent APIs and support for multiple extraction methods.
Questions? Reach out at [email protected].