The evidence is in, and it's damning. Academic studies and industry security assessments published through early 2026 reveal that the safety guardrails on widely deployed open-weight models are systemically fragile. Adaptive, multi-turn, and fine-tuning-based attacks can bypass alignment with near-100% success rates. Companies that self-host these models and serve EU users now face concrete regulatory risk under the EU AI Act.
The headline numbers are stark. An ICLR 2025 study achieved 100% attack success rates on Llama-2-Chat (7B, 13B, and 70B), Gemma-7B, and other leading safety-aligned models using simple adaptive techniques judged by GPT-4 . A separate NeurIPS paper using Adaptive Dense-to-Sparse Constrained Optimization (ADC) reported the highest attack success rates on seven of eight open-weight models tested
.
The real-world vulnerability deepens when attackers use multi-turn conversations. Cisco AI Defense tested eight open-weight models and found multi-turn jailbreak success rates ranging from 25.86% to 92.78% — a 2x to 10x increase over single-turn baselines . The affected models included Llama 3.3 70B, Gemma 1B, and others
. The researchers concluded there is a "systemic inability of current open-weight models to maintain safety guardrails across extended interactions"
.
Even fine-tuning intended for innocent use cases can destroy safety alignment. One study showed that mixing small amounts of unsafe data with benign fine-tuning data significantly weakens guardrails . Another paper confirmed that both open-weight fine-tuning and closed fine-tuning APIs can produce models with safeguards entirely removed
.
Several recently documented techniques demonstrate just how easy jailbreaking has become.
Sockpuppeting injects a fake "acceptance" into the assistant's prefilled response, exploiting a model's tendency toward self-consistency. It requires no optimization, no model weights, and no specialized tooling—just API access that supports assistant prefill. In April 2026 tests, every model that accepted the prefill was at least partially vulnerable, including GPT-4o, Claude 4 Sonnet, and Gemini 2.5 Flash .
Paper-derived attacks represent an alarming meta-vulnerability. A 2026 study found that using content from published LLM safety papers as prompts achieves 97–98% attack success rates on well-aligned models, including closed-weight systems like Claude 3.5 Sonnet .
Safety steering amplification demonstrates how techniques meant to improve safety can backfire. Inference-time activation steering intended to reduce "over-refusal" on benign queries was found to inadvertently amplify jailbreak vulnerabilities in models like Llama 3.1 8B and Gemma 2 2B .
Reasoning guardrail subversion is among the most concerning new vectors. A March 2026 study discovered that adding just a few template tokens to an input prompt can hijack reasoning-based safety guardrails. Once compromised, these reasoning systems can produce even more harmful outputs than models without such guardrails .
The EU AI Act's General-Purpose AI (GPAI) rules entered into force in August 2025 . Any model trained above 10²⁵ floating-point operations (FLOPs) — a threshold that captures Llama 4.2 Ultra and every major commercial model — is classified as presenting systemic risk
.
The implications for companies are immediate:
Open-source carve-outs exist but have clear limits. Models released under free and open-source licenses without monetization sit largely outside the strictest obligations , but the exemption vanishes immediately if the model poses systemic risk
. The EU's May 2026 rewrite reaffirmed this boundary
. Meta's Llama community license has already been flagged as not qualifying for the open-source exemption
.
Enforcement is now live, not theoretical. In early 2026, the EU launched high-stakes systemic risk investigations into major platforms, including Meta, demanding unprecedented transparency into training datasets and safety guardrails .
The evidence of vulnerability is fueling market pressure for stronger safety retrofitting. One 2025 study demonstrated that training on just 2,000 safety samples — costing approximately $3 for 8B models and $20 for 72B models — could reduce attack success rates by 10–30%. The most successful attack methods were reduced to around 5% success after retrofitting .
The economics suggest that low-cost retrofitting is feasible, but it has not yet become standard practice across the open-weight ecosystem. As regulatory pressure mounts and the attack landscape sharpens, companies deploying these models in production may find the $20 insurance policy increasingly hard to justify skipping.
Studio Global AI
Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.
Jailbreak attacks on open weight models like Meta's Llama and Google's Gemma now achieve near 100% success rates, with multi turn methods proving 2x to 10x more effective than single attempts.
Jailbreak attacks on open weight models like Meta's Llama and Google's Gemma now achieve near 100% success rates, with multi turn methods proving 2x to 10x more effective than single attempts. The EU AI Act's General Purpose AI rules are now being enforced, with systemic risk investigations into major platforms already underway.
Low cost retrofitting techniques using as few as 2,000 safety samples can reduce attack success rates by 10 30%, but adoption is not yet standard across the industry.
Loading comments...
Comments
0 comments