答案已发布2个月前Last edited 上个月29 来源

开源模型安全护栏形同虚设，欧盟监管重拳即将落下

针对Meta Llama、谷歌Gemma等开源模型的越狱攻击成功率已接近100%，多轮对话方法的攻击效力是单次的2到10倍。欧盟AI法案的通用人工智能规则已正式执行，对Meta等主要平台的全系统风险评估调查已经启动。仅需2000个安全样本（8B模型成本约3美元）的低成本安全加固技术，就能将攻击成功率降低10%到30%，但尚未成为行业标准。

使用 Studio Global AI 搜索并核查事实浏览更多热门页面

Fragile AI safety shield on open-weight large language models with jailbreak vulnerability concept — How vulnerable are the safety guardrails on widely deployed open-weight AI models like Meta's Llama and Google's Gemma, and what do recent iRecent studies show that current safety alignment techniques on open-weight AI models are systematically fragile against adaptive jailbreak attacks.
AI 提示
Create a landscape editorial hero image for this Studio Global article: How vulnerable are the safety guardrails on widely deployed open-weight AI models like Meta's Llama and Google's Gemma, and what do recent i. Article summary: The safety guardrails on widely deployed open-weight models like Meta's Llama and Google's Gemma are **highly vulnerable** to systematic jailbreak. Multiple recent academic studies and industry investigations show that c. Topic tags: general, academic, general web, user generated, government. Reference image context from search candidates: Reference image 1: visual subject "A report cover featuring a colorful visualization of data flows from open-source AI models like Meta's Llama and Google's Gemma, highlighting safety concerns related to guardrails" Reference image 2: visual subject "A digital illustration features a stylized kangaroo outline with circuit-lik
openai.com

证据确凿，且触目惊心。截至2026年初，多项学术研究和产业安全评估共同揭示了一个残酷的现实：广泛部署的主流开源大模型，其安全护栏正面临系统性的脆弱性危机。无论是自适应、多轮对话，还是基于微调的攻击手段，都能以近乎100%的成功率绕过安全对齐。对于那些自行托管这些模型并为欧洲用户提供服务的公司而言，一把名为“欧盟AI法案”的监管利剑已悬于头顶。

越狱问题到底有多严重？

核心数据直白得令人不安。一篇被机器学习顶级会议ICLR 2025收录的论文显示，仅通过简单的自适应技术，攻击者就能对Llama-2-Chat（70亿、130亿和700亿参数版本）、Gemma-7B以及其他主流安全对齐模型实现100%的攻击成功率 。另一篇发表在NeurIPS上的论文，则利用自适应的“稠密到稀疏约束优化”方法，在八款被测开源模型中，于七款上取得了最高的攻击成功率。

当攻击者采用多轮对话策略时，现实世界的脆弱性进一步加剧。思科AI防御团队测试了八款开源模型，发现多轮越狱的成功率从25.86%到92.78%不等——这比单轮基线足足高出了2到10倍。受影响的模型包括Llama 3.3 70B、Gemma 1B等多个主流选择。研究人员得出结论，这揭示了“当前开源模型在延长交互中维持安全护栏的系统性无能” 。

即便是出于善意用途的微调，也可能摧毁安全对齐。一项研究表明，在无害的微调数据中混入少量不安全样本，就会显著削弱模型的安全护栏。另一篇论文则证实，无论是基于开源权重的微调，还是通过闭源API进行的微调，都可能催生出安全机制被彻底移除的模型。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜索并核查事实

人们还问