答案已发布4天前Last edited 前天30 来源

Gemma 4 QAT：在消费级显卡和1GB手机上跑31B大模型

谷歌为 Gemma 4 全系列推出量化感知训练 (QAT) 版本，将模型内存占用降低约 72%，31B 大模型现可在单片消费级显卡上运行，E2B 小型模型更被压缩至 1GB 左右。涵盖 E2B、E4B、12B、26B A4B 和 31B 五个参数量级，提供压缩张量、GGUF/Q4 0 和全新移动端专用量化格式，但需注意，直接进行原始 Q4 0 转换可能导致精度大幅下降（26B 模型 Top 1 准确率仅 70.2%）。

使用 Studio Global AI 搜索并核查事实浏览更多热门页面

281K0

Google Gemma 4 QAT model compression unlocking mobile and consumer GPU deployment illustrated as a large neural network being compressed efficiently into a smartphone. — What are the key details of Google's June 4 release of Gemma 4 QAT models, including their quantization approach, supported model sizes andGoogle's QAT checkpoints compress Gemma 4 models by roughly 72%, enabling deployment on hardware from smartphones to consumer GPUs.
AI 提示
Create a landscape editorial hero image for this Studio Global article: What are the key details of Google's June 4 release of Gemma 4 QAT models, including their quantization approach, supported model sizes and. Article summary: Google provides official Quantization-Aware Training (QAT) checkpoints for Gemma 4, and the Gemma 4 lineup includes E2B, E4B, 12B, 26B A4B, and 31B sizes [1][4][5]. Here are the key details.. Topic tags: general, documentation, general web, user generated. Reference image context from search candidates: Reference image 1: visual subject "# What Is Google Gemma 4? Google Gemma 4 is the most capable open model family from DeepMind yet, shipping four sizes under Apache 2.0 with multimodal input, native reasoning, and" source context "What Is Google Gemma 4? Architecture, Benchmarks, and Why It ..." Reference image 2: visual subject "# What Is Google Gemma 4? Google
openai.com

长期以来，在个人电脑上运行像 31B 这样的“大模型”似乎是硬件发烧友的专属。但谷歌最近的一次发布，彻底改变了这个局面。

谷歌正式为整个 Gemma 4 系列模型推出了量化感知训练 (QAT) 版检查点。这可不是简单地在模型训练完再压缩的“马后炮”，而是一种让模型在训练时就学会“适应低精度环境”的黑科技。结果就是，4 位量化后的模型性能几乎与原始 16 位版本持平，但内存占用却暴降了约 72% 。

这次发布涵盖五个不同规模的模型，还引入了一种专为手机设计的全新量化格式。对于那些受硬件限制，只能“远观”大模型的开发者和研究者来说，这次发布意味着“动手玩”的时代真正开始了。

QAT 凭什么比传统量化强？

传统的训练后量化 (PTQ) ，是先把模型按最高标准训练好，然后再把它的参数精度“砍”下来，比如从 bfloat16 砍到 int4。问题是，模型从来没学过在低精度下工作，性能往往会明显下降。

而 QAT 则完全不同，它在训练过程中就模拟了量化效果。模型在前向和后向传播时看到的都是量化后的数值，它会主动学习如何在更窄的数值表示下保持精度，最终产出的 4 位版本，性能是“接近原始水平”，而不是原版的“残次品” 。

这次发布的官方检查点采用 W4A16 方案：4 位整型权重搭配 16 位激活，搭配 32 的 group_size 和 compressed-tensors 格式。这种低比特权重和高精度激活的组合，在节省内存和保证推理吞吐之间找到了绝佳的平衡点。

Studio Global AI

Search, cite, and publish your own answer

Use this topic as a starting point for a fresh source-backed answer, then compare citations before you share it.

使用 Studio Global AI 搜索并核查事实

人们还问