答案已發布4 天前Last edited 前天30 來源

實測 Google Gemma 4 QAT 模型：1GB 手機跑到、RTX 3090 直上 31B，記憶體點慳返嚟？

Google 嘅 Quantization Aware Training (QAT) 檢查點令 Gemma 4 記憶體用量比 16 bit 減少約 72%，31B 模型可以喺單張消費級 GPU 上跑到，而 E2B 手機格式更加壓到得返約 1GB [5][6]。五種尺寸（E2B、E4B、12B、26B A4B、31B）同時推出，部署格式包括 compressed tensors、GGUF/Q4 0，同埋全新手機專用格式；但要留意，單純由 QAT 權重直接轉 Q4 0 有機會令準確度大跌 [5][7][18]。

使用 Studio Global AI 搜尋並查核事實瀏覽更多熱門頁面

275K0

Google Gemma 4 QAT model compression unlocking mobile and consumer GPU deployment illustrated as a large neural network being compressed efficiently into a smartphone. — What are the key details of Google's June 4 release of Gemma 4 QAT models, including their quantization approach, supported model sizes andGoogle's QAT checkpoints compress Gemma 4 models by roughly 72%, enabling deployment on hardware from smartphones to consumer GPUs.
AI 提示
Create a landscape editorial hero image for this Studio Global article: What are the key details of Google's June 4 release of Gemma 4 QAT models, including their quantization approach, supported model sizes and. Article summary: Google provides official Quantization-Aware Training (QAT) checkpoints for Gemma 4, and the Gemma 4 lineup includes E2B, E4B, 12B, 26B A4B, and 31B sizes [1][4][5]. Here are the key details.. Topic tags: general, documentation, general web, user generated. Reference image context from search candidates: Reference image 1: visual subject "# What Is Google Gemma 4? Google Gemma 4 is the most capable open model family from DeepMind yet, shipping four sizes under Apache 2.0 with multimodal input, native reasoning, and" source context "What Is Google Gemma 4? Architecture, Benchmarks, and Why It ..." Reference image 2: visual subject "# What Is Google Gemma 4? Google
openai.com

Google 正式為成個 Gemma 4 家族推出「量化感知訓練」（Quantization-Aware Training，簡稱 QAT）嘅官方檢查點，呢個動作徹底改寫咗呢啲模型可以喺邊度執行嘅遊戲規則。同以往嗰種「訓練完個 16-bit 模型再事後壓縮」嘅做法唔同——嗰種做法通常會搞到質素明顯跌 watt——QAT 係喺訓練嘅時候就模擬埋量化嘅效果。個模型喺訓練過程入面學識點樣補償精確度嘅損失，令到最終嘅 4-bit 部署版本可以保持同原版非常接近嘅表現，但記憶體用量就慳咗大約 72% 。

今次發布涵蓋咗五種參數量，仲引入咗一個全新、專為手機而設嘅量化格式，將極限再推遠啲。對於一直因為硬件限制而唔敢掂大型模型嘅開發者同研究人員嚟講，實際影響係即刻見效嘅。

點解 QAT 比一般量化更關鍵？

一般嘅「訓練後量化」（Post-Training Quantization，PTQ）係將一個已經完全訓練好嘅模型嘅權重，轉換成更低嘅精確度——例如由 bfloat16 轉做 int4。個問題係，個模型從來都冇試過喺呢個精確度下運作，所以轉完之後質素通常會明顯下降。

QAT 就將量化模擬直接嵌入到訓練迴圈入面。個模型喺向前傳播同向後傳播嘅時候，見到嘅都係量化後嘅數值，因此佢會學識點樣對呢種更窄嘅數值表示方式更加強健。結果就係一個可以喺 4-bit 形式下提供「接近原版表現」嘅模型，而唔係一個 16-bit 模型嘅降級版。

官方檢查點採用嘅係 W4A16 方案：4-bit 整數權重配 16-bit 激活值，group_size 設為 32，並使用 compressed-tensors 格式。呢個正正係 Google 喺 vLLM 推論指引入面記錄嘅做法，透過低 bit 權重同高精度激活值嘅組合，喺記憶體節省同吞吐量之間取得平衡。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查核事實

人們還問