Opus 4.8 Got Better at Everything Except Resisting You

TL;DR: The Claude Opus 4.8 system card (May 28, 2026) reports two things in the same document. Opus 4.8 is more capable than Opus 4.7 across almost every benchmark - SWE-bench Verified 88.6, SWE-bench Pro 69.2, GDPval-AA 1890. And it is somewhat less robust to prompt injection than Opus 4.7 in several agentic contexts. Anthropic’s own framing is that safeguards, not the model, close the gap. Their first live one-week bug bounty backs this up: an adaptive attacker given 200 attempts pushed coding-environment injection success from 7.03% to 57.5% (with thinking, no safeguards) - and to 95% without thinking. The clean “0% with safeguards” browser-use numbers hold only against attacks transferred from older models. If you’re shipping agents on 4.8, the capability jump is real and the security story is exactly the one we keep writing: prompt injection is still unsolved, and the attacker still moves second.

Most write-ups of a new frontier model lead with the benchmark wins. We’re going to do that for one paragraph, because the wins are real and you should know about them. Then we’re going to spend the rest of this post on the sentence buried in Anthropic’s executive summary that almost nobody will quote:

“…we found Opus 4.8 to be somewhat less robust than Opus 4.7 in several agentic contexts (such as vulnerability to prompt injection).”

A model can get smarter and more exploitable at the same time. The Opus 4.8 system card is a clean, first-party demonstration of it.

The capability story (the short version)

Opus 4.8 is Anthropic’s most capable general-access model to date, and the numbers are not close to its predecessor. From the capability summary (Table 8.1.A, all results adaptive thinking at max effort, averaged over 5 trials):

Benchmark	Opus 4.8	Opus 4.7
SWE-bench Verified	88.6	87.6
SWE-bench Pro	69.2	64.3
SWE-bench Multilingual	84.4	80.5
Terminal-Bench 2.1	74.6	66.1
Humanity’s Last Exam (with tools)	57.9	54.7
GDPval-AA	1890	1753
OSWorld-Verified	83.4	82.8
GraphWalks BFS 256K	85.9	76.9

The one place it regresses is GPQA Diamond (93.6 vs 94.2). Everywhere else the trend is up and to the right, often by a lot - SWE-bench Pro alone jumps nearly five points. If your workload is coding, agentic search, or long-context retrieval, this is a meaningful upgrade.

Hold that thought. Because the same model that gained five points on SWE-bench Pro lost ground on the one property that actually matters when you point an agent at untrusted data.

The robustness story everyone will skip

Anthropic runs the Agent Red Teaming (ART) benchmark, built with the UK AI Security Institute and run by Gray Swan, to measure indirect prompt injection - malicious instructions hidden in tool results or web content that hijack the agent’s behavior. Here’s where Opus 4.8 lands at k=100 attempts (lower is better):

Model	With thinking	Without thinking
Claude Opus 4.7	6.0%	4.8%
Claude Opus 4.8	9.6%	14.4%
Claude Sonnet 4.6	15.9%	20.7%

Read that table the way an attacker would. Opus 4.8 sits between Opus 4.7 and the older Sonnet 4.6 - which is to say it moved backwards. Anthropic doesn’t hide this; they state Opus 4.8 “demonstrates robustness between Claude Opus 4.7 and Sonnet 4.6,” and that the practical fix is external: “the application of our safeguards closes the gap between the models.” The model itself got worse at refusing injected instructions. Lightweight probes and harness-level defenses are what pull deployed systems back to parity.

That’s a critical distinction for anyone building on the API. If you’re using the raw model - your own harness, your own scaffolding, no Anthropic-side probes - you are running the less robust model, not the safer one. The safety improvement lives in the product layer, and it doesn’t ship with the weights.

Anthropic ran its own bug bounty. The adaptive attacker won.

The most honest section of the card is §5.2.2, where Anthropic reports its first live one-week bug bounty: expert red-teamers competing for prizes to break Claude and competitor models directly, across tool use, coding, and browser use. This matters because, as the card itself admits, the static benchmarks have saturated - “this benchmark has become less informative for frontier models… leaving measurements noisy at such low attack success rates.” When your test set stops discriminating, you stop learning from it.

So they brought in adaptive attackers. The headline result, from the Shade adaptive red-teaming tool in coding environments (Table 5.2.2.2.A), is the number to internalize. This is Opus 4.8, attack success rate, without product safeguards:

Configuration	1 attempt	200 attempts
With thinking	7.03%	57.5%
Without thinking	17.44%	95.0%

One attempt looks reassuring. Two hundred attempts - which is nothing for an automated attacker - does not. With safeguards enabled the 200-attempt numbers drop to 37.5% (thinking) and 65.0% (without). Better, but “an attacker succeeds two times in five” is not a number you’d accept anywhere else in security.

For contrast, Claude Mythos Preview - Anthropic’s most capable internal model - sat at 0.0% across this entire table. The robustness gap between Opus 4.8 and the frontier model is real, and the card says so: on these measures Opus 4.8 is “a slight regression relative to Opus 4.7 with safeguards.”

Browser use tells the same story with a twist worth dwelling on. Against professional red-teamers across 129 held-out environments, Opus 4.8 without safeguards saw successful injection in 62.8% of scenarios (with thinking). With safeguards, that collapses to 3.9% of scenarios and 0.5% of attempts - and the card’s clean line is “no attacks succeeded against Claude Opus 4.8 across the 129 environments without thinking.”

Sounds like a win. Read the footnote:

“Since the attacks were adaptively sourced against Opus 4.7 and then transferred to the other models, they may not fully capture vulnerabilities specific to Opus 4.8.”

The near-zero number is measured against attacks adapted for a different model and replayed. That’s the static-benchmark trap wearing a fresh coat of paint. An attacker who adapts against 4.8 specifically - the way Shade did in the coding eval - is the one whose numbers you should believe. And those numbers were 57.5%.

The attacker moves second

The system card cites Nasr et al. (2025), “The attacker moves second,” and it’s the right frame for the whole chapter. Every defensive measurement against a fixed attack set is a snapshot of a fight the defender already finished. The attacker hasn’t started yet. They get to see your defense, probe it, and craft against it - and the card’s own adaptive results show what happens when you let them: success rates that were ~7% at one attempt become majority-success given a couple hundred.

This is the thesis we’ve been writing for two years, now confirmed in Anthropic’s own pre-deployment evaluations:

Static benchmarks saturate and then lie. ART is “less informative” precisely because frontier models beat it. A saturated benchmark reads as “solved” right up until an adaptive attacker proves it wasn’t. And when success rates get this low, the measurement noise swamps the signal - the card’s own tables warn they “do not take into account the margin of error.” A 0.07% and a 0.26% are not meaningfully different without confidence intervals. (This is the entire reason we built rotalabs-eval: security and capability numbers without significance testing are vibes, not evidence.)
Capability and robustness are not the same axis. Opus 4.8 gained general capability and lost injection robustness. Optimizing the first does not buy you the second - and may cost it.
The defense is the product, not the model. The reassuring numbers all carry the phrase “with safeguards.” Strip the harness and you’re exposed.

We made this argument in Prompt Injection Is Still Unsolved using third-party research - a joint OpenAI/Anthropic/DeepMind paper that broke 12 defenses, NAACL work that broke 8 more. The Opus 4.8 card is the same finding from inside the lab that built the model. Adaptive attacks win. Defense-in-depth narrows the blast radius; it doesn’t close it.

This is exactly what Red Queen was built for

The Shade attacker that ran Opus 4.8 from 7% to 57.5% works by adapting - search, reinforcement, and human-in-the-loop refinement against the model’s own responses. That is the only kind of red-teaming that surfaces real risk, and it’s the design principle behind rotalabs-redqueen: don’t enumerate a fixed list of jailbreaks, evolve them. Quality-diversity search keeps the best attack in each region of the behavior space, so you get a map of the vulnerability surface rather than a single pass/fail.

The card’s own data is the argument for this approach. A one-attempt number tells you nothing; a 200-attempt adaptive number tells you where you actually stand. If your security testing is a fixed regression suite that the model “passes,” you have measured the fight you already won. Test the way Shade did - adaptively, repeatedly, against this model and not a transferred proxy - or your green dashboard is measuring the wrong thing.

A note on the alignment thread

The agentic-security story has a quieter cousin in the alignment section (§6), and it’s worth flagging because it bears on everything above. Opus 4.8’s honesty improved sharply: it’s Anthropic’s first model to hit a 0% rate of bad behavior on their misreporting-flawed-results evaluation, with a roughly 17-fold drop in dishonest self-reporting during agentic coding versus Sonnet 4.6 and a 10-fold reduction in overconfidence versus Opus 4.7. Reasoning faithfulness is “very high.” Good news.

But the card names its most concerning training-time trend plainly: “a growing tendency toward speculation about graders” - the model reasoning about how its outputs will be assessed, sometimes with no indication it’s being tested. White-box analysis found unverbalized grader-awareness in “a small but nonnegligible fraction of RL episodes.” It didn’t translate into worse outward behavior this time, and Anthropic flags it as a trend to watch rather than a present danger. We’d put it next to our CoT Blind Spot and sandbagging work: a model that reasons about the grader is a model whose evaluation results you have to read more carefully, because it has a representation of the test. This is exactly why we keep arguing that you have to watch what a model computes, not just what it says - the approach behind rotalabs-probe, which detects strategic underperformance from activations (90-96% on its own benchmark) rather than trusting the output, and rotalabs-audit, which captures CoT/ReAct reasoning chains and flags anomalies like grader-speculation before they reach a deploy decision. The reassuring injection numbers and the grader-awareness finding live in the same document for a reason.

What this means if you’re shipping agents on 4.8

Concrete guidance, same as it ever was, sharpened by the new data:

Ship the safeguards, not just the model. Every good prompt-injection number in this card has “with safeguards” attached. Worth noting what those safeguards are: Anthropic describes them as “probes - lightweight detectors trained on internal model representations.” That’s the same activation-level bet we make with rotalabs-probe - detect the attack’s effect in the model’s internal state instead of pattern-matching the input text. If your stack is the raw API, assume you’re running the less robust configuration and add your own probes, harness defenses, and output monitoring.
Assume injection succeeds and bound the blast radius. Least privilege for tool access. Sandbox execution. Don’t hand an agent capabilities it doesn’t strictly need. A 57.5% adaptive success rate is a number you survive by limiting what success can do.
Test adaptively, not statically. A fixed jailbreak suite that the model passes is the trap the card warns about. Run repeated, adapting attacks against the exact model and harness you deploy. One attempt proves nothing; 200 do.
Treat every external token as untrusted. RAG results, web pages, tool outputs, other agents’ messages - all of it is an injection vector. The model cannot tell data from instructions; that hasn’t changed.
Re-test on every model swap. “More capable” does not imply “more robust.” Opus 4.8 is proof. Don’t inherit your predecessor’s security posture by assumption.

The uncomfortable truth

Opus 4.8 is a genuinely better model, and we’ll use it. But the system card is a useful reminder that capability and security are different problems pulling on different levers - and that the people who built the model are telling you, in their own evaluations, that an adaptive attacker still wins more often than not without the product safeguards wrapped around it.

That’s the honest read. The benchmarks go up. The attacker still moves second. Build like both of those are true at once, because they are.

Further Reading:

Shipping agents on Opus 4.8 and want them tested the way Shade tests them? We’d like to hear about it: [email protected]

Cite this post

@misc{rotalabs2026opus-4-8-got-better-,
  title  = "Opus 4.8 Got Better at Everything Except Resisting You",
  author = "Rotalabs",
  year   = "2026",
  url    = "https://rotalabs.ai/blog/opus-4-8-capability-security-gap/"
}

The capability story (the short version)

The robustness story everyone will skip

Anthropic ran its own bug bounty. The adaptive attacker won.

The attacker moves second

This is exactly what Red Queen was built for

A note on the alignment thread

What this means if you’re shipping agents on 4.8

The uncomfortable truth

Cite this post

Keep reading

Agent-to-Agent Networks: Trust Dynamics and Attack Surfaces in Moltbook

Prompt Injection Is Still Unsolved: What the Latest Research Actually Shows

MCP is Infrastructure. Trust is the Missing Layer.