Last October, a team of 14 researchers from OpenAI, Anthropic, and Google DeepMind published a paper that should have been a wake-up call. They took 12 published defenses against prompt injection and jailbreaking, applied adaptive attacks, and bypassed all of them.
A month later, NAACL 2025 published similar findings. Eight defenses tested. All bypassed. Attack success rates over 50% across the board.
The message is clear: we don’t have a solution to prompt injection. And with 53% of companies now running RAG and agentic pipelines according to OWASP, this isn’t an academic problem anymore.
The Current State of Defenses
Let’s look at what’s been tried and why it keeps failing.
Input filtering. Check incoming text for known attack patterns. The problem is obvious - attackers just rephrase. You’re playing whack-a-mole with an adversary who has infinite creativity and you have finite rules.
Output filtering. Catch suspicious responses before they reach the user. Better than nothing, but you’re already too late. The model has processed the malicious input. Any side effects (tool calls, state changes) may have already happened.
Instruction hierarchy. Tell the model “system instructions override user input.” Models don’t reliably follow this. A sufficiently clever prompt can convince the model that the “real” system instruction is actually the injected one.
Spotlighting. Microsoft’s approach - use delimiters and formatting to clearly separate trusted from untrusted content. Helps, but doesn’t solve the fundamental issue. The model still processes everything together.
Perplexity detection. Flag inputs that look statistically unusual. Attackers can craft low-perplexity injections that read naturally.
The pattern here is that every defense assumes you can identify or isolate malicious content at the text level. But prompt injection isn’t really a text problem. It’s a semantic problem. The model can’t distinguish between “data to process” and “instructions to follow” because both are just tokens.
What’s Actually Working (Sort Of)
The most promising results come from multi-agent architectures. A paper from late 2025 describes a pipeline where specialized agents inspect inputs before they reach the main model. In their tests, they hit 100% mitigation against 55 attack types across 8 categories.
That sounds great until you remember the OpenAI/Anthropic/DeepMind finding: defenses fail against adaptive attacks. Test against known attack patterns and you’ll do well. Face an attacker who can probe and adapt, and your defense crumbles.
Microsoft’s production approach combines multiple layers - Prompt Shields for detection, Defender for Cloud integration, hardened system prompts, data governance controls. Defense in depth. It’s the right philosophy, but it’s also an admission that no single technique works.
The January 2026 comprehensive review proposes PALADIN, a five-layer defense framework. Again, layered defense. The authors explicitly state that prompt injection is a “fundamental architectural vulnerability” requiring defense-in-depth rather than singular solutions.
The RAG Problem Makes Everything Worse
Here’s a number that should worry you: researchers demonstrated that just five carefully crafted documents can manipulate AI responses 90% of the time through RAG poisoning.
Five documents. Ninety percent success rate.
When your AI agent retrieves context from a database, knowledge base, or the internet, every piece of retrieved content is a potential injection vector. The agent can’t tell the difference between legitimate context and context designed to hijack its behavior.
This is why OWASP’s 2025 update added two new entries: System Prompt Leakage (LLM07:2025) and Vector and Embedding Weaknesses (LLM08:2025). The threat surface has expanded beyond direct user input to include everything the model touches.
A Real Vulnerability: GitHub Copilot CVE-2025-53773
In case this all sounds theoretical, consider CVE-2025-53773. GitHub Copilot had a remote code execution vulnerability with a CVSS score of 9.6 out of 10.
The attack worked by injecting malicious instructions into code context that Copilot would process. The model would then generate code containing the attacker’s payload, which could execute on the developer’s machine.
This is prompt injection leading to RCE in a tool used by millions of developers. It doesn’t get more concrete than that.
Why Detection Might Beat Prevention
Here’s where our work at Rotalabs comes in.
We’ve been developing activation-level probes that can detect intent by looking at model internals rather than outputs. Our sandbagging detection work showed that you can identify when a model is planning to behave deceptively with 90-96% accuracy, before it generates any output.
The same approach might work for prompt injection.
Think about what happens when a model processes a prompt injection. Somewhere in its activations, there’s a representation of “I’m being instructed to do X” where X is the injected behavior. If we can identify that representation, we can detect the injection regardless of how it’s phrased.
This sidesteps the whole text-level filtering problem. We’re not looking for attack patterns in the input. We’re looking for attack effects in the model’s internal state.
Early experiments suggest this is viable. The challenge is that prompt injection is more varied than sandbagging - there’s no single “injection direction” in activation space. But the methodology extends naturally, and we’re actively working on it.
What You Should Do Now
If you’re deploying agents or RAG systems, here’s practical advice:
Assume injection will happen. Design your system so that successful injection has limited impact. Principle of least privilege for tool access. Sandbox execution environments. Don’t give agents access to capabilities they don’t absolutely need.
Layer your defenses. No single technique works. Combine input filtering, output monitoring, behavioral analysis, and human oversight for high-stakes actions.
Monitor for anomalies. Track what your agents actually do. Sudden changes in behavior patterns might indicate successful injection. Log extensively.
Treat retrieved content as untrusted. Everything from RAG, web browsing, or external APIs is a potential vector. Filter and validate before letting the model process it.
Stay current. This field moves fast. The defenses that work today may be broken tomorrow. Build systems that can adapt.
The Uncomfortable Truth
Prompt injection might not be fully solvable at the current paradigm of LLM architecture. The model processes all input through the same mechanism. There’s no hardware-level separation between code and data, between instructions and content.
We’re essentially running user-provided code on a system with no memory protection, no privilege levels, no process isolation. Of course it’s vulnerable.
The long-term solution might require architectural changes to how models work - explicit separation of instruction processing from data processing, formal verification of instruction sources, cryptographic attestation of trusted prompts.
Until then, we’re in a defensive battle. Layered defenses, continuous monitoring, limited blast radius, and detection mechanisms that look at what the model is actually doing rather than just what it’s saying.
That’s not a satisfying answer. But it’s an honest one.
Further Reading:
- Comprehensive Review: Prompt Injection Attacks in LLMs and AI Agent Systems (Jan 2026)
- Adaptive Attacks Break Defenses (NAACL 2025)
- Multi-Agent Defense Pipeline (arXiv)
- Microsoft’s Defense Approach
- OWASP Top 10 for LLM Applications 2025
Working on prompt injection detection or defense? We’d like to hear about it: [email protected]