답변게시됨2개월 전Last edited 지난달29 소스

오픈 웨이트 AI 안전 장치, 근본적 결함 드러내… EU, 규제 칼날 빼들다

메타의 라마, 구글의 젬마 등 오픈 웨이트 모델 대상 탈옥 공격 성공률이 100%에 육박하며, 연쇄 대화 기법은 단일 시도보다 최대 10배 더 효과적인 것으로 나타났다. EU AI 법의 범용 AI 규정이 본격 시행되며, 주요 플랫폼을 겨냥한 시스템 리스크 조사가 이미 시작되었다.

Studio Global AI로 검색 및 팩트체크 인기 페이지 더 보기

Fragile AI safety shield on open-weight large language models with jailbreak vulnerability concept — How vulnerable are the safety guardrails on widely deployed open-weight AI models like Meta's Llama and Google's Gemma, and what do recent iRecent studies show that current safety alignment techniques on open-weight AI models are systematically fragile against adaptive jailbreak attacks.
AI 프롬프트
Create a landscape editorial hero image for this Studio Global article: How vulnerable are the safety guardrails on widely deployed open-weight AI models like Meta's Llama and Google's Gemma, and what do recent i. Article summary: The safety guardrails on widely deployed open-weight models like Meta's Llama and Google's Gemma are **highly vulnerable** to systematic jailbreak. Multiple recent academic studies and industry investigations show that c. Topic tags: general, academic, general web, user generated, government. Reference image context from search candidates: Reference image 1: visual subject "A report cover featuring a colorful visualization of data flows from open-source AI models like Meta's Llama and Google's Gemma, highlighting safety concerns related to guardrails" Reference image 2: visual subject "A digital illustration features a stylized kangaroo outline with circuit-lik
openai.com

증거는 명확하고, 그 내용은 충격적이다. 2026년 초까지 발표된 학계 연구와 업계 보안 평가 결과, 널리 배포된 오픈 웨이트 AI 모델의 안전 장치가 구조적으로 취약하다는 사실이 밝혀졌다. 적응형, 연쇄 대화형, 파인튜닝 기반 공격은 거의 100%에 가까운 성공률로 안전 정렬을 무력화시킨다. 이제 이러한 모델을 자체 서버에 구축해 EU 사용자에게 제공하는 기업들은 EU AI 법에 따른 실질적인 규제 리스크에 직면하게 되었다.

탈옥 공격, 얼마나 심각한가?

핵심 수치는 충격적이다. ICLR 2025에서 발표된 한 연구는 GPT-4가 판단하는 간단한 적응형 기법을 사용하여 라마-2-챗 (7B, 13B, 70B), 젬마-7B 및 기타 주요 안전 정렬 모델에서 100% 공격 성공률을 달성했다 . 또 다른 NeurIPS 논문에서는 적응형 Dense-to-Sparse 제약 최적화(ADC) 기법을 사용하여 테스트한 8개의 오픈 웨이트 모델 중 7개에서 가장 높은 공격 성공률을 기록했다 .

공격자가 연속적인 대화를 활용할 때 취약성은 더욱 심화된다. 시스코 AI 디펜스(Cisco AI Defense)가 8개의 오픈 웨이트 모델을 테스트한 결과, 연쇄 대화 탈옥 공격 성공률이 25.86%에서 92.78% 사이로 나타났으며, 이는 단일 시도 대비 2배에서 10배 증가한 수치다 . 공격 대상에는 라마 3.3 70B, 젬마 1B 등이 포함되었다 . 연구진은 “현재 오픈 웨이트 모델이 확장된 상호작용 전반에 걸쳐 안전 장치를 유지하지 못하는 구조적 무능력”을 확인했다고 결론지었다 .

선의의 사용 사례를 위한 파인튜닝조차 안전 정렬을 파괴할 수 있다. 한 연구에 따르면, 소량의 안전하지 않은 데이터를 정상적인 파인튜닝 데이터와 혼합하는 것만으로도 안전 장치가 크게 약화되는 것으로 나타났다 . 또 다른 논문은 오픈 웨이트 파인튜닝과 폐쇄형 파인튜닝 API 모두 안전 장치가 완전히 제거된 모델을 생성할 수 있음을 확인했다 .

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI로 검색 및 팩트체크

사람들은 또한 묻습니다.