Prompting is not a control mechanism. It’s a suggestion.

This matters more in defense than anywhere else. When AI is embedded in kill chains, informing targeting decisions, or running on autonomous platforms operating beyond reliable comms range, “the model usually follows the system prompt” is not an acceptable assurance. ROE compliance can’t be probabilistic. You can’t tell a JAG officer that the system was instructed to follow the law of armed conflict and it mostly does.

I’ve spent the last year working on activation steering and runtime interpretability. The research started in AI safety, but the defense applications became obvious fast. What follows is what I’ve learned about controlling AI behavior at inference time - not through prompting, not through fine-tuning, but through direct intervention in model internals.

Why Prompts Fail When It Matters

System prompts are the default control mechanism for deployed LLMs. They’re also insufficient for anything high-stakes.

The failure modes are well-documented: prompt injection, context window overflow, inconsistent adherence under distribution shift. Commercial applications tolerate this because the cost of failure is low. A customer service bot goes off-script, someone gets a refund, maybe there’s a PR issue.

Defense doesn’t have that margin. Consider the AI systems coming online across the joint force:

  • Decision support tools processing multi-INT feeds and generating COAs
  • Autonomous systems making engagement recommendations or executing pre-authorized responses
  • Logistics optimization under contested conditions
  • ISR processing at scale that humans can’t manually review

Every one of these needs behavioral guarantees that prompting can’t provide. “We told it to follow ROE” isn’t a control. It’s hope with documentation.

flowchart LR subgraph prompt [Prompt-Based Control] P[System Prompt] --> M1[Model] I1[User Input] --> M1 A1[Adversarial Input] -.->|Injection| M1 M1 --> O1[Output] end subgraph steering [Activation Steering] S[Steering Vector] --> L[Layer N] I2[Any Input] --> M2[Model] M2 --> L L --> O2[Controlled Output] A2[Adversarial Input] -.->|Blocked| L end style A1 fill:#f66 style A2 fill:#f66 style S fill:#6f6

Left: Prompts live in the same space as attacks. Right: Steering operates below the text layer, resistant to injection.

Steering Vectors: Behavioral Control Below the Text Layer

The alternative I’ve been working on operates at the activation level.

During inference, transformer models build internal representations layer by layer. These representations don’t just encode “what token comes next” - they encode the model’s contextual understanding, behavioral tendencies, risk posture, and more. Activation steering identifies directions in this representation space that correspond to specific behaviors, then intervenes during inference to push the model along those directions.

Practically: you compute a steering vector offline that corresponds to a behavioral property you care about. At runtime, you add that vector to the model’s activations at specified layers. The model’s behavior shifts accordingly - more consistently than prompt engineering, without consuming context window, and adjustable in real-time.

I’ve tested this specifically on agent behaviors relevant to defense contexts:

  • Constraint compliance (does the system respect boundaries even under adversarial prompting?)
  • Risk tolerance calibration (conservative vs. aggressive posture in ambiguous situations)
  • Information handling (classification-aware response filtering)
  • Escalation thresholds (when does the system flag for human review vs. proceed?)

Steering vectors produce more calibrated, more predictable behavior modification than prompting across most of these dimensions. Not universally superior, but meaningfully more reliable. And reliability is what matters when the system is operating inside someone’s OODA loop.

flowchart TD subgraph transformer [Transformer Forward Pass] I[Input Tokens] --> E[Embedding] E --> L1[Layer 1] L1 --> L2[Layer 2] L2 --> LN[...] LN --> LK[Layer K] LK --> LM[...] LM --> O[Output] end SV[Steering Vector
e.g. +compliance] -->|Add to activations| LK subgraph offline [Offline Computation] D1[Compliant examples] --> Extract D2[Non-compliant examples] --> Extract Extract[Extract activations] --> Diff[Compute difference] Diff --> SV end style SV fill:#6f6 style LK fill:#9f9

Steering vectors are computed offline from contrast pairs, then added at runtime to shift behavior.

The JADC2 Problem

Joint All-Domain Command and Control assumes AI-enabled decision advantage. It also assumes that AI operates across a network that’s under active attack, frequently degraded, and spans classification boundaries.

Think about what this means for AI control:

Disconnected operations are the norm, not the exception. Edge AI on platforms and forward nodes can’t phone home for guidance. The model you deploy is what you have. Behavioral constraints need to be baked in and robust to isolation.

The adversary is trying to break your AI. Prompt injection isn’t theoretical - it’s a natural extension of information warfare. If your behavioral controls are text in a context window, they’re one clever input away from subversion. Chinese and Russian EW doctrine already emphasizes cognitive effects; attacking AI decision aids is an obvious evolution.

Context changes faster than you can redeploy. ROE can change mid-mission. Operational phases shift. New threats emerge. You need to modify AI behavior without pushing new models, without retraining, ideally without even restarting the system.

Everything needs to be auditable. When a system informs a strike decision, the kill chain review will want to know exactly what controls were in place. “The prompt said to follow ROE” won’t survive legal review. You need to demonstrate that specific, verified controls were active at the time of the decision.

Steering vectors address all of these constraints:

  • They’re computed offline and stored as small tensors - trivial to embed on edge hardware
  • They operate below the text interface, resistant to prompt injection
  • They can be hot-swapped at runtime (new ROE = load different steering vector)
  • They’re inspectable and auditable (you can verify which vectors were active)
flowchart LR subgraph constraints [JADC2 Constraints] C1[Disconnected ops] C2[Adversarial environment] C3[Dynamic context] C4[Audit requirements] end subgraph solutions [Steering Solutions] S1[Small tensors
embed on edge] S2[Below text layer
injection resistant] S3[Hot-swappable
no redeploy] S4[Inspectable
which vectors active] end C1 --> S1 C2 --> S2 C3 --> S3 C4 --> S4 style S1 fill:#9f9 style S2 fill:#9f9 style S3 fill:#9f9 style S4 fill:#9f9

Detecting Anomalous AI Behavior in Operations

Control is half the problem. The other half is knowing when your AI is behaving unexpectedly.

This is where activation monitoring comes in. The same internal representations that enable steering also enable detection. You can train lightweight probe models on intermediate activations to flag:

  • Uncertainty that doesn’t surface in output confidence scores
  • Behavioral drift from established baselines
  • Situational awareness anomalies (is the model behaving differently because it’s detected something about its operating context?)
  • Potential adversarial influence on model reasoning

I built a toolkit for this - focused on detecting metacognitive behaviors: cases where the model is reasoning about its own situation in ways that might lead to unexpected actions. This started as alignment research, but the same techniques apply to operational monitoring. The canonical example is sandbagging, where a model deliberately underperforms when it detects evaluation conditions. But the same techniques generalize to operational monitoring.

If your autonomous ISR system starts behaving differently because an adversary has found a way to signal “this is a test,” you want to know. If your decision support AI’s internal uncertainty is spiking while its output confidence stays high, that’s a red flag. These things don’t show up in output logs. They show up in activations.

flowchart TD subgraph model [Model Inference] I[Input] --> L1[Layer 1] L1 --> L2[Layer 2] L2 --> LN[Layer N] LN --> O[Output] end L2 -.->|Sample activations| P1[Uncertainty Probe] LN -.->|Sample activations| P2[Drift Probe] LN -.->|Sample activations| P3[Anomaly Probe] P1 --> D{Thresholds} P2 --> D P3 --> D D -->|Normal| Continue[Continue] D -->|Flagged| Alert[Alert + Log] D -->|Critical| Fallback[Fallback behavior] style Alert fill:#ff9 style Fallback fill:#f66

Lightweight probes sample activations without blocking inference. Anomalies trigger alerts or fallbacks.

Operational Architecture

What this looks like deployed:

Baseline steering vectors ship with the model package. These encode standing behavioral constraints - ROE foundations, classification handling rules, escalation thresholds, domain restrictions. They’re the default operating mode.

Mission-configurable steering layer loads additional vectors based on operational context. Different phases of an operation, different threat environments, different authorities - each maps to a steering configuration. Operators or C2 systems can push these without touching the underlying model.

Runtime activation monitoring samples inference passes and runs probe models to detect anomalies. This doesn’t need to hit every token - statistical sampling catches drift without killing latency. When probes flag something, the system can alert operators, increase logging fidelity, or trigger fallback behaviors.

Audit infrastructure captures which steering vectors were active, what the probes detected, and how confident the system was in its outputs. This feeds kill chain review, after-action analysis, and continuous T&E.

The compute overhead is modest. Steering vectors are small additions to forward passes. Probe models are lightweight classifiers. In my testing, total inference cost increase is single-digit percentages.

flowchart TB subgraph deployed [Deployed System Architecture] direction TB subgraph vectors [Steering Layer] BV[Baseline Vectors
ROE, classification, escalation] MV[Mission Vectors
Phase-specific, threat-specific] end subgraph model [Model + Monitoring] M[Base Model] BV --> M MV --> M M --> P[Probes] end subgraph audit [Audit Infrastructure] P --> L[Logs] M --> L L --> R[Kill chain review] L --> AA[After-action analysis] end end C2[C2 System] -.->|Push new vectors| MV OP[Operator] -.->|Override| MV style BV fill:#9f9 style MV fill:#9cf

Complete operational stack: baseline constraints ship with the model, mission-specific vectors are pushed from C2, everything is logged for audit.

Integration with Existing Programs

This isn’t rip-and-replace. It’s an additional control layer that sits alongside existing AI infrastructure.

If you’re running models through GenAI.mil or CDAO’s reference architectures, steering and monitoring integrate at the inference wrapper level. If you’re deploying commercial models (Anthropic, OpenAI, open-source) in classified environments, the same applies - the steering layer is model-agnostic in principle, though vectors must be computed per-model.

The key architectural assumption: you have access to model internals during inference. This works with open-weight models and with API providers who expose activation access. It doesn’t work with fully black-box APIs - one more reason defense deployments should favor models with transparent internals.

Limitations

I’m not going to oversell this. Real constraints exist:

Model-specific computation. Steering vectors don’t transfer across models. Swap your base model, recompute your vectors. This is tractable (hours to days of compute depending on vector complexity) but not free.

Incomplete coverage. Not every behavior maps cleanly to an activation direction. Some things you want to control are entangled with capabilities you need. The field is improving here, but it’s not solved.

Adversarial robustness is undertested. Steering vectors resist casual prompt injection. I haven’t seen rigorous evaluation against sophisticated, defense-relevant attacks. This needs red-teaming before anyone relies on it for kinetic decisions.

Detection isn’t explanation. Activation monitoring tells you “something’s anomalous,” not always “here’s why.” Sometimes flagging is enough. For legal and ethical review, you often need more.

TEVV gaps. There’s no established framework for testing and evaluating activation-level controls. The DoD’s Responsible AI guidelines don’t yet address this layer. That’s a policy gap, not a technical one, but it matters for program adoption.

Where This Is Going

The trajectory is clear: AI systems need real control surfaces, not just prompts and prayers.

Traditional software security doesn’t rely on input sanitization alone. You build access controls, capability restrictions, monitoring - defense in depth. AI systems need the same layered approach. Prompting is input handling. It’s necessary but not sufficient. The control layer needs to operate independently of input.

Activation engineering is young - maybe two years old as a serious research area. Steering vectors, representation engineering, concept erasure, activation patching - the techniques are improving faster than most program offices can track. What I’ve described here is already more capable than what was possible a year ago.

flowchart TD subgraph layers [Defense in Depth for AI] L1[Layer 1: Input Filtering
Prompts, sanitization] L2[Layer 2: Activation Steering
Behavioral constraints] L3[Layer 3: Activation Monitoring
Anomaly detection] L4[Layer 4: Output Validation
Safety checks] L5[Layer 5: Audit Trail
Full logging] end Attack[Attack Vector] --> L1 L1 -->|Some bypass| L2 L2 -->|Constrained| L3 L3 -->|Flagged if anomalous| L4 L4 --> L5 style L2 fill:#9f9 style L3 fill:#9f9

Layered security for AI: prompting alone is Layer 1. Activation steering and monitoring add Layers 2 and 3.

For contested environments specifically, I expect this becomes standard infrastructure. The current approach of deploying models with prompt-based guardrails won’t survive contact with peer adversaries who understand how to attack AI systems. Something like runtime behavioral control with activation-level monitoring isn’t optional - it’s where this has to go.


If you’re working on AI deployment for defense - whether at a program office, a prime, or a lab - I’m interested in comparing notes. The research is at rotalabs.ai, and the underlying code is public on GitHub.