답변게시됨4일 전Last edited 그저께30 소스

Gemma 4 QAT: 310억 매개변수 모델, 이제 소비자용 GPU와 스마트폰에서도 돌아갑니다

구글의 Gemma 4 양자화 인식 학습(QAT) 체크포인트는 16비트 정밀도 대비 메모리 사용량을 약 72% 절감해, 310억 매개변수 모델을 소비자용 GPU 하나로도 구동 가능하게 만들고 E2B 모델을 단 1GB로 압축합니다. E2B, E4B, 12B, 26B A4B(MoE), 31B 총 5종 모델에 대해 압축 텐서, GGUF/Q4 0, 모바일 최적화 스키마 등 다양한 배포 형식을 지원하나, 무분별한 Q4 0 변환 시 정확도가 크게 하락할 수 있어 공식 체크포인트 사용이 권장됩니다.

Studio Global AI로 검색 및 팩트체크 인기 페이지 더 보기

275K0

Google Gemma 4 QAT model compression unlocking mobile and consumer GPU deployment illustrated as a large neural network being compressed efficiently into a smartphone. — What are the key details of Google's June 4 release of Gemma 4 QAT models, including their quantization approach, supported model sizes andGoogle's QAT checkpoints compress Gemma 4 models by roughly 72%, enabling deployment on hardware from smartphones to consumer GPUs.
AI 프롬프트
Create a landscape editorial hero image for this Studio Global article: What are the key details of Google's June 4 release of Gemma 4 QAT models, including their quantization approach, supported model sizes and. Article summary: Google provides official Quantization-Aware Training (QAT) checkpoints for Gemma 4, and the Gemma 4 lineup includes E2B, E4B, 12B, 26B A4B, and 31B sizes [1][4][5]. Here are the key details.. Topic tags: general, documentation, general web, user generated. Reference image context from search candidates: Reference image 1: visual subject "# What Is Google Gemma 4? Google Gemma 4 is the most capable open model family from DeepMind yet, shipping four sizes under Apache 2.0 with multimodal input, native reasoning, and" source context "What Is Google Gemma 4? Architecture, Benchmarks, and Why It ..." Reference image 2: visual subject "# What Is Google Gemma 4? Google
openai.com

구글이 Gemma 4 모델 전 제품군을 위한 공식 양자화 인식 학습(Quantization-Aware Training, QAT) 체크포인트를 전격 공개했습니다. 이는 단순한 기술 업데이트가 아니라, 고성능 AI 모델이 '어디에서' 구동될 수 있는지에 대한 근본적인 변화를 의미합니다. 기존에는 완성된 16비트 모델을 사후에 압축(Post-Training Quantization, PTQ)하는 과정에서 불가피하게 성능 저하가 발생했습니다. 하지만 QAT는 학습 단계 자체에서 양자화를 시뮬레이션하여, 모델이 정밀도 손실을 스스로 보상하도록 학습합니다. 그 결과, 최종 4비트 배포 버전은 메모리 사용량을 약 72%까지 줄이면서도 원본에 가까운 성능을 유지합니다 .

이번 릴리스는 다섯 가지 매개변수 크기를 아우르며, 스마트폰을 위한 혁신적인 모바일 전용 양자화 포맷까지 도입했습니다. 하드웨어 제약 때문에 대형 모델을 제대로 활용하지 못했던 개발자와 연구자들에게 이 소식은 거대한 진입 장벽을 허무는 실질적인 해결책입니다.

기존 양자화와 차원이 다른 이유

표준 사후 훈련 양자화(PTQ)는 완전히 학습된 모델의 가중치를 낮은 정밀도(예: bfloat16에서 int4)로 변환합니다. 문제는, 모델이 그 낮은 정밀도에서 작동하도록 학습된 적이 없기 때문에 품질 저하가 눈에 띄게 발생한다는 점입니다 .

반면, QAT는 학습 루프 안에 양자화 시뮬레이션을 통합시킵니다. 모델이 순전파와 역전파 과정에서 양자화된 값을 경험하며, 더 좁은 수 표현 범위에 강건해지도록 학습하는 것입니다. 덕분에 4비트 형태로도 '원본에 가까운 성능(near original performance)'을 내며, 이는 16비트 버전의 열화된 복사본이 아닌, 거의 동등한 수준의 AI입니다 .

공식 체크포인트는 W4A16 방식을 사용합니다. 이는 4비트 정수 가중치와 16비트 활성화를 사용하며, group_size는 32, compressed-tensors 포맷을 따릅니다 . 이는 구글이 vLLM 기반 추론을 위해 문서화한 방식과 동일하며, 저비트 가중치와 고정밀 활성화의 조합으로 메모리 절감과 처리량(throughput) 사이의 균형을 맞춥니다 .

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

Studio Global AI로 검색 및 팩트체크

사람들은 또한 묻습니다.