答案已發布4 天前Last edited 前天30 個來源

Gemma 4 QAT：如何在消費級顯卡與 1GB 手機上跑 31B 模型

Google 採用量化感知訓練（QAT）為 Gemma 4 釋出的官方檢查點，相較於 16 位元精度，記憶體用量減少約 72%，使得 31B 模型可在單張消費級顯卡上執行，而 E2B 模型更縮減至僅 1GB 左右。本次提供五種規模——E2B、E4B、12B、26B A4B（MoE）與 31B，並涵蓋 compressed tensors、GGUF/Q4 0 及專為手機設計的全新格式，但需留意若草率轉換為 Q4 0 格式可能導致準確度大幅下降。

使用 Studio Global AI 搜尋並查證事實瀏覽更多熱門頁面

275K0

Google Gemma 4 QAT model compression unlocking mobile and consumer GPU deployment illustrated as a large neural network being compressed efficiently into a smartphone. — What are the key details of Google's June 4 release of Gemma 4 QAT models, including their quantization approach, supported model sizes andGoogle's QAT checkpoints compress Gemma 4 models by roughly 72%, enabling deployment on hardware from smartphones to consumer GPUs.
AI 提示詞
Create a landscape editorial hero image for this Studio Global article: What are the key details of Google's June 4 release of Gemma 4 QAT models, including their quantization approach, supported model sizes and. Article summary: Google provides official Quantization-Aware Training (QAT) checkpoints for Gemma 4, and the Gemma 4 lineup includes E2B, E4B, 12B, 26B A4B, and 31B sizes [1][4][5]. Here are the key details.. Topic tags: general, documentation, general web, user generated. Reference image context from search candidates: Reference image 1: visual subject "# What Is Google Gemma 4? Google Gemma 4 is the most capable open model family from DeepMind yet, shipping four sizes under Apache 2.0 with multimodal input, native reasoning, and" source context "What Is Google Gemma 4? Architecture, Benchmarks, and Why It ..." Reference image 2: visual subject "# What Is Google Gemma 4? Google
openai.com

Google 近日正式為整個 Gemma 4 家族釋出了量化感知訓練（Quantization-Aware Training，簡稱 QAT）的官方檢查點，這項舉措，可說從根本上改變了這些模型的硬體部署版圖。有別於事後才將一個訓練完成的 16 位元模型強行壓縮——這種傳統作法通常會導致品質明顯下降——QAT 則是在訓練過程中直接「模擬」低精度的量化環境，讓模型在學習階段就預先適應數據精度的損失。最終部署的 4 位元版本，不僅能保有逼近原始的推論效能，更讓記憶體需求一口氣砍掉了大約 72% 。

這波釋出涵蓋五種參數量規模，並針對行動裝置環境首次導入一種全新設計的量化格式，將硬體效率推向極限。對於過去受限於設備門檻，只能望大型模型興嘆的開發者與研究者來說，這次的實質影響已是立即可見。

為何 QAT 比一般標準量化更為關鍵？

一般的「訓練後量化」（Post-Training Quantization，PTQ），其作法是在模型全部訓練完畢之後，再將其權重從高精度（例如 bfloat16）轉換為較低精度（例如 int4）。這種作法最大的問題在於，模型從頭到尾都沒有學習過如何在如此低的精度下維持推論品質，因此轉換後效能衰減往往十分顯著。

相對地，QAT 則是在訓練迴圈中直接納入量化模擬。在模型的前向傳播與反向傳播過程中，權重值皆以量化後的狀態進行計算；這使得模型得以學會適應較窄的數值表示範圍，最終鍛鍊出對低精度運算的強健性。其結果就是：你所看到的是能以 4 位元形式「趨近原始效能」的模型，而非一個被粗暴打折的 16 位元縮水版。

依照官方規格，這次的檢查點採用 W4A16 的配置：即權重為 4 位元整數、激活值維持 16 位元，搭配 group_size=32 的參數分組，並以 compressed-tensors 格式進行封裝。這也正是 Google 在「搭配 vLLM 進行推論」的技術文件中，所採用的同一設計思維。透過低精度權重與高精度激活值的組合，此設計在節省記憶體與維持模型吞吐量之間取得了理想平衡。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜尋並查證事實

大家也會問