compressed-tensorsFive model sizes received QAT checkpoints, plus matching drafter models for speculative decoding. Each is available in multiple formats (discussed below), and the practical memory footprints shift dramatically between BF16 and QAT 4-bit .
| Model | Architecture | Active Params | BF16 Memory | QAT 4-bit Memory | Key Hardware Fit |
|---|---|---|---|---|---|
| E2B | Dense + PLE | ~2.3B effective (5.1B with embeddings) | ~9.6 GB | ~3.2 GB (Q4_0); 1 GB (mobile format) | Smartphones, edge devices, browsers |
| E4B | Dense + PLE | ~4.5B effective (8B with embeddings) | ~15 GB | ~5 GB (Q4_0) | Mid-range GPUs, mobile devices with more RAM |
| 12B | Dense, encoder-free unified multimodal | 11.95B | ~24 GB | ~7 GB (Q4_0) | 8 GB GPUs, laptops with dedicated graphics |
| 26B A4B | Mixture of Experts | ~3.8B active (26B total) | ~48 GB | ~15 GB (Q4_0) | 12–16 GB GPUs, high-end workstations |
| 31B | Dense | 30.7B | ~58 GB | ~17–18 GB (Q4_0) | 24 GB GPUs (RTX 3090/4090), high-VRAM setups |
Memory figures come from Google's official model overview and Unsloth documentation, with the Q4_0 numbers representing the popular GGUF quantization level . The mobile-format E2B figure of roughly 1 GB is the headline-grabbing number — Google specifically engineered a custom schema with targeted 2-bit decoding layers and optimized KV caches to get there
. For text-only models without Per-Layer Embeddings, the footprint can reportedly dip below 1 GB
.
The 26B A4B model deserves special attention. It's a Mixture of Experts architecture that activates only about 3.8 billion parameters per token, despite having 26 billion total. This means it delivers compute behavior closer to a 4B model while offering reasoning quality roughly comparable to a much larger dense model . In 4-bit form, it fits on 12-16 GB GPUs — the kind of hardware many developers already own
.
Google released QAT checkpoints in four distinct forms, and the choice of format directly affects quality :
The most important caveat in the entire release concerns naive format conversion. Converting QAT weights directly to Q4_0 without proper handling can drastically reduce accuracy. According to Unsloth's documentation, a naive Q4_0 conversion of the 26B QAT model achieves only about 70.2% top-1 accuracy . Their own Dynamic quantization method pushes that to 85.6%, a 15.4-percentage-point improvement — but the point stands that format selection and conversion methodology are critical to preserving the quality that QAT is supposed to deliver
.
For most users, the official compressed-tensors or GGUF checkpoints are the safest starting point.
QAT doesn't just reduce memory — it reshapes the hardware landscape for local AI inference. Models that previously required data-center GPUs can now run on consumer hardware and even smartphones.
Smartphones and edge devices: E2B is purpose-built for mobile. Google's LiteRT-LM framework can run E2B under 1.5 GB of RAM with 2-bit and 4-bit quantization, and Google's own AI Edge Gallery app on the Play Store lets users select and run E2B or E4B entirely on-device . Both models support text, image, and audio input — real-time speech translation, visual question answering, and on-device assistants become plausible without a cloud connection
.
8 GB GPUs: The sweet spot for QAT deployment. E2B (~3.2 GB), E4B (~5 GB), and the 12B model (~7 GB) all fit comfortably in 8 GB of VRAM at Q4_0 quantization . This means a mid-range laptop with a mobile 4060 or an older desktop 2070 can now run a unified multimodal model with a 256K context window — something that would have required 24 GB or more at 16-bit precision.
12–16 GB GPUs: The 26B A4B MoE model lands here at roughly 15 GB in Q4_0 form, fitting onto cards like the RTX 3080, 4070 Ti, or 4080 . Its MoE architecture means it also maintains lower inference latency than a dense model of similar footprint because only a fraction of parameters activate per token
.
20–24 GB GPUs: The 31B dense model requires about 17–18 GB at Q4_0 quantization, putting it within reach of RTX 3090 and 4090 owners with some headroom for KV cache and batch size . At full 16-bit precision, this model demands nearly 60 GB — entirely out of reach for consumer GPUs. QAT makes the largest Gemma 4 model genuinely practical on a single high-end consumer card.
Important reality check: Memory figures discussed here represent model weight sizes, not total VRAM consumption. Runtime overhead — particularly KV cache for long context windows — can add gigabytes on top. The 31B model with a 256K context will consume significantly more memory than the base weight size, and community reports suggest context-heavy workloads may push requirements into the low 20 GB range . Always budget extra headroom beyond the listed Q4_0 weight footprint.
QAT's core promise is near-original performance at dramatically reduced memory — and the benchmarks broadly bear this out. Google's own documentation describes performance as "near original" at around 72% memory reduction, and community benchmarks suggest quality loss in the 3–5% range for Q4 quantization compared to BF16 .
But the devil is in the details. The naive conversion warning from Unsloth — 70.2% top-1 accuracy on the 26B model versus 85.6% after their Dynamic optimization — demonstrates that the quality you get depends heavily on how you convert and deploy the QAT weights . If you simply pull a QAT checkpoint and run it through a standard GGUF converter without QAT-aware handling, you may not get the quality you expect.
For production use, the safest approach is to use Google's official QAT checkpoints directly in their compressed-tensors format (for vLLM) or the official GGUF files from Hugging Face . If you need custom quantization beyond what Google provides, budget time for benchmarking — QAT weights are more sensitive to conversion methodology than standard post-training quantized weights.
On a practical level, this release changes the default answer to "can I run this model locally?" For the first time, a major open-weights model family ships with QAT checkpoints as a first-class citizen, not as an afterthought. The implications cascade across several application categories:
Privacy-sensitive workloads: Medical, legal, and personal assistant applications that previously required a cloud API can now run entirely on-device on a laptop or phone, with QAT preserving enough quality to make local inference genuinely useful .
Offline and edge deployment: Field research, disaster response, and industrial settings without reliable connectivity can deploy capable multimodal models on commodity hardware. E2B's audio support paired with 1 GB mobile quantization makes real-time speech translation on a mid-range phone a practical reality .
Developer tooling and IDEs: The 12B and 26B models fit on the hardware developers already own, enabling code completion, refactoring, and documentation generation that runs locally without latency or cost constraints. Google specifically positioned the quantized versions for "IDEs, coding assistants and agentic workflows" .
Experimentation and fine-tuning: Smaller research teams and independent developers who couldn't afford A100 or H100 clusters can now work with models in the 12B–31B range on consumer hardware, dramatically lowering the barrier to entry for model customization and domain-specific fine-tuning.
Google released the checkpoints under the same Apache 2.0 license as the base Gemma 4 models, and they're available immediately on Hugging Face across all five model sizes .
Comments
0 comments