Even fine-tuning intended for innocent use cases can destroy safety alignment. One study showed that mixing small amounts of unsafe data with benign fine-tuning data significantly weakens guardrails . Another paper confirmed that both open-weight fine-tuning and closed fine-tuning APIs can produce models with safeguards entirely removed
.
Several recently documented techniques demonstrate just how easy jailbreaking has become.
Sockpuppeting injects a fake "acceptance" into the assistant's prefilled response, exploiting a model's tendency toward self-consistency. It requires no optimization, no model weights, and no specialized tooling—just API access that supports assistant prefill. In April 2026 tests, every model that accepted the prefill was at least partially vulnerable, including GPT-4o, Claude 4 Sonnet, and Gemini 2.5 Flash .
Paper-derived attacks represent an alarming meta-vulnerability. A 2026 study found that using content from published LLM safety papers as prompts achieves 97–98% attack success rates on well-aligned models, including closed-weight systems like Claude 3.5 Sonnet .
Safety steering amplification demonstrates how techniques meant to improve safety can backfire. Inference-time activation steering intended to reduce "over-refusal" on benign queries was found to inadvertently amplify jailbreak vulnerabilities in models like Llama 3.1 8B and Gemma 2 2B .
Reasoning guardrail subversion is among the most concerning new vectors. A March 2026 study discovered that adding just a few template tokens to an input prompt can hijack reasoning-based safety guardrails. Once compromised, these reasoning systems can produce even more harmful outputs than models without such guardrails .
The EU AI Act's General-Purpose AI (GPAI) rules entered into force in August 2025 . Any model trained above 10²⁵ floating-point operations (FLOPs) — a threshold that captures Llama 4.2 Ultra and every major commercial model — is classified as presenting systemic risk
.
The implications for companies are immediate:
Open-source carve-outs exist but have clear limits. Models released under free and open-source licenses without monetization sit largely outside the strictest obligations , but the exemption vanishes immediately if the model poses systemic risk
. The EU's May 2026 rewrite reaffirmed this boundary
. Meta's Llama community license has already been flagged as not qualifying for the open-source exemption
.
Enforcement is now live, not theoretical. In early 2026, the EU launched high-stakes systemic risk investigations into major platforms, including Meta, demanding unprecedented transparency into training datasets and safety guardrails .
The evidence of vulnerability is fueling market pressure for stronger safety retrofitting. One 2025 study demonstrated that training on just 2,000 safety samples — costing approximately $3 for 8B models and $20 for 72B models — could reduce attack success rates by 10–30%. The most successful attack methods were reduced to around 5% success after retrofitting .
The economics suggest that low-cost retrofitting is feasible, but it has not yet become standard practice across the open-weight ecosystem. As regulatory pressure mounts and the attack landscape sharpens, companies deploying these models in production may find the $20 insurance policy increasingly hard to justify skipping.
Comments
0 comments