Gemma 4 Quantization Compared: Q4 vs QAT vs Q8 (Speed & Quality)
Gemma 4 ships several quantization methods for the same model. I measured all three — Q4_K_M (4-bit), QAT (Quantization-Aware Training), and Q8_0 (8-bit) — for both speed and quality on an RTX 3090 + RTX 3060 setup.
This article looks at how speed and quality change when you switch quantization methods. The short version: on the 26B MoE, QAT matched Q4_K_M’s speed while improving the GPQA score, and Q8_0 was slower but answered every question correctly. If you have VRAM to spare, go with Q8_0; if not, QAT is a strong pick.
Measured June 2026, using Ollama. This is a follow-up to the earlier article, “Gemma 4 vs Qwen 3.6: which should you run on your GPU?“
How the quantization methods differ
| Method | Summary | Approx. size |
|---|---|---|
| Q4_K_M | Compressed to 4-bit after training (Post-Training Quantization). Ollama’s default | ~30% of FP16 |
| QAT | Trained with quantization in mind (Google’s official blog). Released June 5, 2026. Higher quality than Q4 at the same size | ~30% of FP16 |
| Q8_0 | 8-bit quantization. Quality close to FP16, but about twice the size of Q4 | ~50% of FP16 |
Test setup
Models compared
| Model | Quantization | File size | VRAM (measured) | GPU placement |
|---|---|---|---|---|
| Gemma 4 26B MoE | Q4_K_M | 18GB | 19GB | 3090 only |
| Gemma 4 26B MoE | QAT | 15GB | 17GB | 2-GPU split |
| Gemma 4 26B MoE | Q8_0 | 28GB | 29GB | 2-GPU split |
| Gemma 4 31B dense | Q4_K_M | 19GB | 26GB | 2-GPU split |
| Gemma 4 31B dense | QAT | 18GB | 25GB | 2-GPU split |
| Gemma 4 31B dense | Q8_0 | 33GB | 35GB | 2-GPU split (ctx=2048) |
With MoE (Mixture of Experts), only 3.8B of the full 26B parameters are active at a time — but all 26B still have to fit in VRAM. A dense model always runs all of its parameters.
Speed comparison
Generation speed (tok/s) — by quantization method
Measured with Thinking OFF, same prompt. Blue = Q4_K_M, green = QAT, orange = Q8_0.
MoE Q4_K_M was the fastest at 107 tok/s, and Q8_0 fell to 62 tok/s. Even the dense model dropped about 37%, from 24.5 to 15.5 tok/s going Q4→Q8. QAT is the same size as Q4_K_M, but on the 26B MoE it came in a bit slower at 90 tok/s. That’s because Ollama chose a 2-GPU split — an effect of the placement strategy, not of the quantization method itself.
Reasoning accuracy (GPQA Diamond, 5 questions)
I had each model solve five PhD-level science questions (physics, chemistry, biology) in Thinking mode.
| Model | Q1 Biology | Q2 Chemistry | Q3 Chemistry | Q4 Physics | Q5 Physics | Accuracy |
|---|---|---|---|---|---|---|
| 26B MoE Q4_K_M | D ✓ | B ✓ | A ✓ | B ✗ | C ✓ | 80% |
| 26B MoE QAT | D ✓ | B ✓ | A ✓ | D ✓ | C ✓ | 100% |
| 26B MoE Q8_0 | D ✓ | B ✓ | A ✓ | D ✓ | C ✓ | 100% |
| 31B dense Q4_K_M | D ✓ | B ✓ | A ✓ | D ✓ | C ✓ | 100% |
| 31B dense QAT | D ✓ | B ✓ | A ✓ | B ✗ | C ✓ | 80% |
| 31B dense Q8_0 | D ✓ | B ✓ | A ✓ | D ✓ | C ✓ | 100% |
GPQA Diamond sample of 5 questions (Thinking ON). A small sample, so treat it as a rough guide. On the 26B MoE, only Q4_K_M missed Q4 (a photon-energy question); QAT and Q8_0 got it right. On the 31B dense, only QAT missed Q4.
On the 26B MoE, QAT and Q8_0 correctly answered the question Q4_K_M missed (Q4, a photon-energy problem). On the 31B dense, it was QAT alone that got Q4 wrong — with only five questions, there’s some noise in the results. Even so, the fact that Q8_0 aced every question on both the 26B MoE and the 31B dense hints that quantization precision may be affecting reasoning quality.
Output quality comparison
I compared output quality with a single prompt (“the pros and cons of running a local LLM on a home PC, in about 500 words"). The differences between quantization methods were tiny. Every variant produced natural prose, and they shared a tendency to structure their answers with bold subheadings.
Coding quality
When I asked for an implementation of a Trie data structure, all six variants returned a correct class definition and method implementations. I saw no difference in coding quality across quantization methods.
What happens when you run out of VRAM
I also tried loading 31B Q8_0 (a 33GB file) onto the RTX 3090 + RTX 3060 (36GB combined).
| Setting | Result |
|---|---|
| Default (num_ctx=32768) | CUDA out of memory — won’t run |
| num_ctx=2048 + all layers on GPU | Runs — 15.5 tok/s (VRAM 35.3GB / 36.9GB used) |
At the default context length (32768 tokens), the KV cache doesn’t fit in VRAM and you get an OOM error. Capping it at num_ctx=2048 lets it run, but then it’s useless for processing long documents.
I also measured how the number of GPU layers (num_gpu) — that is, the amount of CPU offload — relates to speed.
31B Q8_0: GPU layers vs. generation speed
num_ctx=2048 fixed. Units: tok/s. More GPU layers = faster. Mostly-CPU drops to 1.7 tok/s.
The fewer layers on the GPU, the more sharply speed falls. At 20 layers (almost all CPU) it managed just 1.7 tok/s — about one-ninth of the 15.7 tok/s you get with all layers on the GPU. A model that doesn’t fit in VRAM slows down, in proportion to how much is offloaded to the CPU, until it’s effectively unusable.
Summary — how to choose a quantization method
| Situation | Recommendation | Why |
|---|---|---|
| Plenty of VRAM (Q8 fits fully) | Q8_0 | Aced GPQA. Slower, but the highest quality |
| Moderate VRAM (enough for Q4) | QAT | Same size as Q4_K_M with better quality. Just swap it in |
| Speed above all | Q4_K_M (MoE) | Fastest at 107 tok/s. Good enough quality for everyday use |
| Q8 only barely fits (or not) | Stick with QAT | Risk of big speed drops from context limits or CPU offload |
QAT is an option you can switch to from Q4_K_M with no downside in size or speed. The file is the same size or smaller, and on the 26B MoE I even saw improved reasoning accuracy (on the 31B dense there was a one-in-five wobble). If you’re using Q4_K_M in Ollama, it’s worth trying ollama pull gemma4:26b-a4b-it-qat.
Q8_0 is the option when you have VRAM to spare. The 26B MoE Q8_0 (28GB) won’t fit on an RTX 3090 alone, but split across two GPUs it runs at a practical 62 tok/s. For quality-focused work (specialized reasoning, code generation, and the like), it’s worth considering.
Forcing a model that doesn’t fit into VRAM is best avoided. 31B Q8_0 (33GB) OOMs on 36GB of VRAM unless you cap the context length at 2048, and even then speed drops to about 63% of Q4_K_M. “A little short" is synonymous with “a lot slower."
References
Hardware used for testing
[kimono_product id="15761″]
[kimono_product id="15759″]