Gemma 4 Quantization Compared: Q4 vs QAT vs Q8 (Speed & Quality)

2026年7月5日

Gemma 4 ships several quantization methods for the same model. I measured all three — Q4_K_M (4-bit), QAT (Quantization-Aware Training), and Q8_0 (8-bit) — for both speed and quality on an RTX 3090 + RTX 3060 setup.

This article looks at how speed and quality change when you switch quantization methods. The short version: on the 26B MoE, QAT matched Q4_K_M’s speed while improving the GPQA score, and Q8_0 was slower but answered every question correctly. If you have VRAM to spare, go with Q8_0; if not, QAT is a strong pick.

Measured June 2026, using Ollama. This is a follow-up to the earlier article, “Gemma 4 vs Qwen 3.6: which should you run on your GPU?“

1. How the quantization methods differ
2. Test setup
3. Models compared
4. Speed comparison
5. Reasoning accuracy (GPQA Diamond, 5 questions)
6. Output quality comparison
7. Coding quality
8. What happens when you run out of VRAM
9. Summary — how to choose a quantization method
- 9.1. References
- 9.2. Hardware used for testing

How the quantization methods differ

Method	Summary	Approx. size
Q4_K_M	Compressed to 4-bit after training (Post-Training Quantization). Ollama’s default	~30% of FP16
QAT	Trained with quantization in mind (Google’s official blog). Released June 5, 2026. Higher quality than Q4 at the same size	~30% of FP16
Q8_0	8-bit quantization. Quality close to FP16, but about twice the size of Q4	~50% of FP16

Test setup

GPU (main)

RTX 3090 24GB

GPU (secondary)

RTX 3060 12GB

CPU / RAM

Ryzen 9 3950X / 64GB

Inference engine

Ollama

Models compared

Model	Quantization	File size	VRAM (measured)	GPU placement
Gemma 4 26B MoE	Q4_K_M	18GB	19GB	3090 only
Gemma 4 26B MoE	QAT	15GB	17GB	2-GPU split
Gemma 4 26B MoE	Q8_0	28GB	29GB	2-GPU split
Gemma 4 31B dense	Q4_K_M	19GB	26GB	2-GPU split
Gemma 4 31B dense	QAT	18GB	25GB	2-GPU split
Gemma 4 31B dense	Q8_0	33GB	35GB	2-GPU split (ctx=2048)

With MoE (Mixture of Experts), only 3.8B of the full 26B parameters are active at a time — but all 26B still have to fit in VRAM. A dense model always runs all of its parameters.

Speed comparison

Generation speed (tok/s) — by quantization method

26B MoE Q4_K_M

107 tok/s

26B MoE QAT

90 tok/s

26B MoE Q8_0

62 tok/s

31B dense QAT

26 tok/s

31B dense Q4_K_M

24.5 tok/s

31B dense Q8_0

15.5 tok/s

Measured with Thinking OFF, same prompt. Blue = Q4_K_M, green = QAT, orange = Q8_0.

MoE Q4_K_M was the fastest at 107 tok/s, and Q8_0 fell to 62 tok/s. Even the dense model dropped about 37%, from 24.5 to 15.5 tok/s going Q4→Q8. QAT is the same size as Q4_K_M, but on the 26B MoE it came in a bit slower at 90 tok/s. That’s because Ollama chose a 2-GPU split — an effect of the placement strategy, not of the quantization method itself.

Reasoning accuracy (GPQA Diamond, 5 questions)

I had each model solve five PhD-level science questions (physics, chemistry, biology) in Thinking mode.

Model	Q1 Biology	Q2 Chemistry	Q3 Chemistry	Q4 Physics	Q5 Physics	Accuracy
26B MoE Q4_K_M	D ✓	B ✓	A ✓	B ✗	C ✓	80%
26B MoE QAT	D ✓	B ✓	A ✓	D ✓	C ✓	100%
26B MoE Q8_0	D ✓	B ✓	A ✓	D ✓	C ✓	100%
31B dense Q4_K_M	D ✓	B ✓	A ✓	D ✓	C ✓	100%
31B dense QAT	D ✓	B ✓	A ✓	B ✗	C ✓	80%
31B dense Q8_0	D ✓	B ✓	A ✓	D ✓	C ✓	100%

GPQA Diamond sample of 5 questions (Thinking ON). A small sample, so treat it as a rough guide. On the 26B MoE, only Q4_K_M missed Q4 (a photon-energy question); QAT and Q8_0 got it right. On the 31B dense, only QAT missed Q4.

On the 26B MoE, QAT and Q8_0 correctly answered the question Q4_K_M missed (Q4, a photon-energy problem). On the 31B dense, it was QAT alone that got Q4 wrong — with only five questions, there’s some noise in the results. Even so, the fact that Q8_0 aced every question on both the 26B MoE and the 31B dense hints that quantization precision may be affecting reasoning quality.

Output quality comparison

I compared output quality with a single prompt (“the pros and cons of running a local LLM on a home PC, in about 500 words"). The differences between quantization methods were tiny. Every variant produced natural prose, and they shared a tendency to structure their answers with bold subheadings.

Coding quality

When I asked for an implementation of a Trie data structure, all six variants returned a correct class definition and method implementations. I saw no difference in coding quality across quantization methods.

What happens when you run out of VRAM

I also tried loading 31B Q8_0 (a 33GB file) onto the RTX 3090 + RTX 3060 (36GB combined).

Setting	Result
Default (num_ctx=32768)	CUDA out of memory — won’t run
num_ctx=2048 + all layers on GPU	Runs — 15.5 tok/s (VRAM 35.3GB / 36.9GB used)

At the default context length (32768 tokens), the KV cache doesn’t fit in VRAM and you get an OOM error. Capping it at num_ctx=2048 lets it run, but then it’s useless for processing long documents.

I also measured how the number of GPU layers (num_gpu) — that is, the amount of CPU offload — relates to speed.

31B Q8_0: GPU layers vs. generation speed

20 layers (mostly CPU)

1.7

40 layers

3.0

55 layers

7.1

60 layers

12.8

All layers on GPU

15.7

num_ctx=2048 fixed. Units: tok/s. More GPU layers = faster. Mostly-CPU drops to 1.7 tok/s.

The fewer layers on the GPU, the more sharply speed falls. At 20 layers (almost all CPU) it managed just 1.7 tok/s — about one-ninth of the 15.7 tok/s you get with all layers on the GPU. A model that doesn’t fit in VRAM slows down, in proportion to how much is offloaded to the CPU, until it’s effectively unusable.

Summary — how to choose a quantization method

Situation	Recommendation	Why
Plenty of VRAM (Q8 fits fully)	Q8_0	Aced GPQA. Slower, but the highest quality
Moderate VRAM (enough for Q4)	QAT	Same size as Q4_K_M with better quality. Just swap it in
Speed above all	Q4_K_M (MoE)	Fastest at 107 tok/s. Good enough quality for everyday use
Q8 only barely fits (or not)	Stick with QAT	Risk of big speed drops from context limits or CPU offload

QAT is an option you can switch to from Q4_K_M with no downside in size or speed. The file is the same size or smaller, and on the 26B MoE I even saw improved reasoning accuracy (on the 31B dense there was a one-in-five wobble). If you’re using Q4_K_M in Ollama, it’s worth trying ollama pull gemma4:26b-a4b-it-qat.

Q8_0 is the option when you have VRAM to spare. The 26B MoE Q8_0 (28GB) won’t fit on an RTX 3090 alone, but split across two GPUs it runs at a practical 62 tok/s. For quality-focused work (specialized reasoning, code generation, and the like), it’s worth considering.

Forcing a model that doesn’t fit into VRAM is best avoided. 31B Q8_0 (33GB) OOMs on 36GB of VRAM unless you cap the context length at 2048, and even then speed drops to about 63% of Q4_K_M. “A little short" is synonymous with “a lot slower."

References

Hardware used for testing

[kimono_product id="15761″]

[kimono_product id="15759″]

▶ Go deeper on local AI (related)

Local AI,Models