Gemma 4 vs Qwen 3.6: Which Should You Run on Your GPU?

2026年7月5日

I dropped Google’s Gemma 4 and Alibaba’s Qwen 3.6 onto my home PC with an RTX 3090 and compared their performance. Both are open-weight LLMs (large language models) released in April 2026, and each comes in several sizes.

In this article I measured generation speed, output quality, math reasoning, and coding ability to figure out “which model fits your GPU’s VRAM." Bottom line: on a 24GB GPU, Qwen 3.6 27B was the best balance of quality and speed; on a 12GB GPU, Gemma 4 12B was.

Measured June 2026, via Ollama with Q4_K_M quantization.

1. Test setup
2. The models compared
3. Generation speed across all models
4. If you’re on a 24GB GPU (RTX 3090 / 4090)
5. If you’re on a 12GB GPU (RTX 3060 / 4060)
6. If you’re on an 8GB-or-less GPU
7. Output quality: a hands-on comparison
- 7.1. Gemma 4 31B’s answer
- 7.2. Qwen 3.6 27B’s answer
8. Where they stand against paid models
- 8.1. How to read the benchmarks
9. Differences in Thinking mode
10. Summary — the best model for each VRAM tier
- 10.1. References
- 10.2. Hardware used for testing

Test setup

GPU (main)

RTX 3090 24GB

GPU (secondary)

RTX 3060 12GB

CPU / RAM

Ryzen 9 3950X / 64GB

Inference engine / quantization

Ollama / Q4_K_M

The models compared

In addition to every size of Gemma 4 and Qwen 3.6, I included the previous-generation Qwen 3 / 3.5.

Models under test

Model	Type	Parameters	File size	Approx. VRAM needed
Gemma 4 E4B	dense	8B	9.6GB	~6GB
Gemma 4 12B	dense	12B	7.4GB	~8GB
Gemma 4 26B	MoE (4B active)	26B	17GB	~18GB
Gemma 4 31B	dense	31B	19GB	~20GB
Qwen 3.6 27B	dense	27B	17GB	~19GB
Qwen 3.6 35B-A3B	MoE (3B active)	35B	23GB	~25GB
Qwen 3 8B	dense (prev. gen)	8B	5.2GB	~6GB
Qwen 3.5 9B	dense (prev. gen)	9B	6.6GB	~7GB
Qwen 3 14B	dense (prev. gen)	14B	9.3GB	~14GB

VRAM figures are measured values at Q4_K_M quantization. With MoE, all parameters sit in VRAM, but fewer of them run per inference step (= faster).

Qwen 3.6 comes in only two sizes, 27B and 35B-A3B, with nothing in the 8–14B range. Since Qwen 3.6 isn’t an option on 12GB-or-less GPUs, you’ll be using the previous-generation Qwen 3 / 3.5 there.

Generation speed across all models

Generation speed (tok/s) — all models, measured

Gemma 4 E4B local

110 tok/s

Gemma 4 26B MoE local

107 tok/s

Qwen 3.6 35B-A3B MoE local

97 tok/s

GPT-5.5 API

~61 tok/s

Gemini 2.5 Flash API

58 tok/s

Claude Sonnet 4.6 API

~48 tok/s

Qwen 3.6 27B local

39 tok/s

Gemma 4 12B local

31 tok/s

Gemma 4 31B local

24 tok/s

Gemini 2.5 Pro API

15 tok/s

Blue = Gemma 4 / green = Qwen / orange = Gemini / gray = published API figures (Artificial Analysis). Local models measured on RTX 3090 + Q4_K_M + Thinking OFF. API speeds depend on your network.

MoE (Mixture of Experts) models run only a fraction of their parameters per inference step, so they’re far faster than dense models. Gemma 4 26B MoE, with about 4B active parameters, hit 107 tok/s — beating the paid APIs GPT-5.5 (about 61 tok/s) and Claude Sonnet 4.6 (about 48 tok/s), a clear illustration of the speed advantage of local MoE models. That said, API speeds swing with server load and network conditions, so treat them as reference points only (speed data source: Artificial Analysis).

If you’re on a 24GB GPU (RTX 3090 / 4090)

With 24GB of VRAM, every model is on the table. The four main candidates were:

For 24GB GPUs — all candidate models compared

Metric	Qwen 3.6 27B	Gemma 4 31B	Gemma 4 26B MoE	Qwen 3.6 35B-A3B MoE
Type	dense	dense	MoE (4B active)	MoE (3B active)
Speed (measured)	39 tok/s	24 tok/s	107 tok/s	97 tok/s
GPQA 5Q (measured)	4/5	4/5	4/5	4/5
VRAM used	~19GB	~20GB (2 GPUs)	~18GB	~25GB (2 GPUs)
Coding	◎	○	○	○
Multimodal	Images & video	Images	Images	Images & video
BenchLM overall	73	64	−	−

For quality, Qwen 3.6 27B. Its BenchLM overall score beat Gemma 4 31B, 73 to 64, and on coding benchmarks the gap was wide — an average of 70.6 vs 41.6. Speed is a practical 39 tok/s, and it fits on a single 24GB GPU. Its SWE-bench Verified pass rate was 77.2%. That still trails the paid Claude Opus 4.8 (88.6%) and GPT-5.5 (88.7%), but it’s a remarkable level for a model you can run locally for free.

For speed, Gemma 4 26B MoE. At 107 tok/s it’s about three times faster than Qwen 3.6 27B. Being MoE it has only 4B active parameters, but output quality and math reasoning had no practical issues. It’s plenty for chat and light writing.

For dense-model quality, Gemma 4 31B. This model runs all of its parameters rather than using MoE, and on benchmarks it came close to Qwen 3.6 27B. But at 24 tok/s it was the slowest, and even on a 24GB GPU it needs a 2-GPU split, so you have to be ready for the speed hit.

If you’re on a 12GB GPU (RTX 3060 / 4060)

With 12GB of VRAM, no size of Qwen 3.6 fits (even the smallest, 27B, needs about 19GB). Your options are Gemma 4’s smaller models or the previous-generation Qwen 3 / 3.5.

For 12GB GPUs — candidate models compared

Metric	Gemma 4 12B	Gemma 4 E4B	Qwen 3.5 9B	Qwen 3 8B
Speed	35 tok/s	71 tok/s	99 tok/s	127 tok/s
VRAM used	~8GB	~6GB	~7GB	~6GB
Math reasoning	◎	◎	◎	◎
Quality	◎	○	○	○
Multimodal	Images + audio	Images + audio	Images	None
Generation	Latest	Latest	Prev. gen	2 gens back

For quality, Gemma 4 12B. It’s the newest model in the 12B class, with the highest output quality and well-structured answers. Speed is a modest 35 tok/s, but it fit comfortably on a single 12GB GPU.

For speed, Gemma 4 E4B. It manages a practical 71 tok/s. It accepts image and audio input — both E4B and 12B of Gemma 4 handle audio (Qwen 3.5 9B supports image input). Output quality is slightly below Gemma 4 12B, but it was plenty for chat use.

If you’re on an 8GB-or-less GPU

On 8GB or less (RTX 3050, GTX 1070, and the like), the realistic options are Gemma 4 E4B (~6GB VRAM) or Qwen 3 8B (~6GB VRAM). Qwen 3 8B is the fastest at 127 tok/s, but as an older generation it lagged on benchmarks. If you don’t need multimodal, picking Qwen 3 8B for the speed is one reasonable move.

Output quality: a hands-on comparison

I compared with the same prompt: “the pros and cons of running a local LLM on a home PC, in about 500 words." Here are the answers from the two main 24GB-class models.

Gemma 4 31B’s answer

$ ollama chat gemma4:31b
The biggest advantages of running a local LLM on your home PC are complete privacy and customizability. Unlike cloud services, your data isn’t sent to an external server, so you can process confidential documents or personal information with peace of mind. And if you choose a model without content filters, you’re free to write creatively however you like, or fine-tune it for a specific specialized field.
On the other hand, the drawbacks are hardware cost and performance limits. Running it comfortably needs a high-end GPU (one with lots of VRAM), which means an outlay of tens of thousands to hundreds of thousands of yen to get started. (truncated)

Qwen 3.6 27B’s answer

$ ollama chat qwen3.6:27b
A local LLM’s biggest advantages are data privacy and offline operation. For instance, you can summarize and analyze a company’s confidential contracts or an individual’s medical records safely inside your PC, without sending them to the cloud. And because no internet connection is required, you can use it reliably even on a plane or somewhere with poor connectivity. On cost, after the initial investment there are no subscription fees, so it’s economical over the long run.
On the other hand, the drawbacks are the high hardware requirements and inference speed. You need a high-end GPU (16GB+ VRAM recommended) and plenty of memory, and it struggles to run on an older PC. (truncated)

Gemma 4 31B tends to structure with bold text and headings, while Qwen 3.6 27B summarized the key points more concisely. Both produced natural prose, and both held up well on a formal-rewrite test too. Models of 12B and under are fine as far as basic language goes, but 27B-and-up models pull ahead in the depth of their answers and the richness of their examples.

Where they stand against paid models

How to read the benchmarks

A note on the benchmarks cited in this article.

Benchmark	What it measures	Questions	Difficulty
GPQA Diamond	PhD-level science reasoning. Four-choice questions written by physics, chemistry, and biology experts; non-experts score poorly even with Google	198	Random guessing = 25%. Even experts ~65%
SWE-bench Verified	Read an issue (bug report) from a real GitHub Python project, generate a code patch, and fix it. Correct if the tests pass	500	Measures real-world software-engineering ability
MMLU / MMLU-Pro	A knowledge test spanning 57 fields (STEM, humanities, social science, and more). Pro is a harder 10-choice version	~14,000 / 12,000	Broad knowledge, college to expert level

Here’s how the models actually did on five sample GPQA Diamond questions (biology, chemistry, physics), run locally.

GPQA Diamond, 5 sample questions: number correct (measured)

Gemma 4 31B

4/5 (80%)

Qwen 3.6 27B

4/5 (80%)

Gemma 4 26B MoE

4/5 (80%)

Qwen 3.6 35B-A3B MoE

4/5 (80%)

Gemma 4 12B

2/5 (40%)

Gemma 4 E4B

2/5 (40%)

Gemini 2.5 Flash API

1/5 (20%)

Run with Thinking ON. Only 5 questions, so treat as a rough guide. Gemini 2.5 Flash was via API (thinking not specified).

What stands out is that the MoE models (Gemma 4 26B MoE, Qwen 3.6 35B-A3B) matched the dense models at 80%. MoE has only 3–4B active parameters, but because it picks the best expert from all parameters for each token, its reasoning accuracy was higher than that small active count would suggest. Below 12B it fell to 40%, a clear sign of the model-size wall. The paid-model comparison below cites official benchmark results.

I’ve organized the gap between locally runnable models and paid cloud APIs using public benchmarks.

Compared with paid models (public benchmarks)

Benchmark	Qwen 3.6 27B	Gemma 4 31B	Claude Sonnet 4.6	Claude Opus 4.8	GPT-5.5	Gemini 3 Pro
GPQA Diamond	87.8%	85.7%	89.9%	93.6%	−	91.9%
SWE-bench Verified	77.2%	−	−	88.6%	88.7%	~78%
MMLU	−	−	−	−	92.4%	~90%
MMLU-Pro	86.2%	85.2%	−	−	−	−
Pricing	Free (local)	Free (local)	Metered API	Metered API	Metered API	Metered API

Sources: BenchLM.ai, Anthropic, OpenAI, Google official (April–June 2026). Green headers are local models.

On GPQA Diamond (PhD-level science reasoning), the locally runnable Qwen 3.6 27B (87.8%) and Gemma 4 31B (85.7%) have closed in on Claude Sonnet 4.6 (89.9%). When I put Gemini 2.5 Flash through the same five-question sample via API, it scored 1/5 (20%). Flash is a speed-focused model, so it gives ground on reasoning accuracy. On SWE-bench Verified, Qwen 3.6 27B’s 77.2% trails Claude Opus 4.8 (88.6%) and GPT-5.5 (88.7%) — on coding, the paid models lead by more than 10 points.

Differences in Thinking mode

Both Gemma 4 31B and Qwen 3.6 27B have a Thinking mode (think before answering). In Ollama you control it with "think": true/false.

Their behavior differed. Gemma 4 31B generates about 1,000 characters of thinking before answering. Qwen 3.6 27B tended to generate more than 5,000 characters of detailed thinking. Thinking tokens count toward the generation cap (num_predict), so if you turn Thinking ON for Qwen 3.6 27B, you need to set num_predict to 2048 or higher — otherwise the thinking can eat the whole budget and the answer comes back empty.

Summary — the best model for each VRAM tier

Recommended model by VRAM

VRAM	Best for quality	Best for speed	Notes
24GB	Qwen 3.6 27B (39 tok/s)	Gemma 4 26B MoE (107 tok/s)	For dense-model quality, Gemma 4 31B
12GB	Gemma 4 12B (35 tok/s)	Gemma 4 E4B (71 tok/s)	Qwen 3.6 doesn’t fit in 12GB
8GB or less	Gemma 4 E4B (71 tok/s)	Qwen 3 8B (127 tok/s)	For audio input too, Gemma 4 E4B

If you have a 24GB GPU, Qwen 3.6 27B is the overall best. It’s well-balanced on speed, quality, and benchmarks, and on SWE-bench it showed performance approaching Claude Opus 4.8. Ideally, pair it with Gemma 4 26B MoE for the moments when you want speed.

On a 12GB GPU, Gemma 4 12B is the solid pick. Since Qwen 3.6 doesn’t offer a model in this VRAM tier, Gemma 4 is effectively the best option. When you need speed, switch to Gemma 4 E4B.

References

In 2026, open-weight models run at a level rivaling paid models if you have a 24GB GPU. Even at 12GB you can now get practical quality. Start by checking your GPU’s VRAM and picking a model from the tables above.

Hardware used for testing

[kimono_product id="15761″]

[kimono_product id="15759″]

▶ Go deeper on local AI (related)

Local AI,Models