Gemma 4 vs Qwen 3.6: Which Should You Run on Your GPU?

I dropped Google’s Gemma 4 and Alibaba’s Qwen 3.6 onto my home PC with an RTX 3090 and compared their performance. Both are open-weight LLMs (large language models) released in April 2026, and each comes in several sizes.

In this article I measured generation speed, output quality, math reasoning, and coding ability to figure out “which model fits your GPU’s VRAM." Bottom line: on a 24GB GPU, Qwen 3.6 27B was the best balance of quality and speed; on a 12GB GPU, Gemma 4 12B was.

Measured June 2026, via Ollama with Q4_K_M quantization.

Test setup

Test setup
GPU (main)
RTX 3090 24GB
GPU (secondary)
RTX 3060 12GB
CPU / RAM
Ryzen 9 3950X / 64GB
Inference engine / quantization
Ollama / Q4_K_M

The models compared

In addition to every size of Gemma 4 and Qwen 3.6, I included the previous-generation Qwen 3 / 3.5.

Models under test

Model Type Parameters File size Approx. VRAM needed
Gemma 4 E4B dense 8B 9.6GB ~6GB
Gemma 4 12B dense 12B 7.4GB ~8GB
Gemma 4 26B MoE (4B active) 26B 17GB ~18GB
Gemma 4 31B dense 31B 19GB ~20GB
Qwen 3.6 27B dense 27B 17GB ~19GB
Qwen 3.6 35B-A3B MoE (3B active) 35B 23GB ~25GB
Qwen 3 8B dense (prev. gen) 8B 5.2GB ~6GB
Qwen 3.5 9B dense (prev. gen) 9B 6.6GB ~7GB
Qwen 3 14B dense (prev. gen) 14B 9.3GB ~14GB

VRAM figures are measured values at Q4_K_M quantization. With MoE, all parameters sit in VRAM, but fewer of them run per inference step (= faster).

Qwen 3.6 comes in only two sizes, 27B and 35B-A3B, with nothing in the 8–14B range. Since Qwen 3.6 isn’t an option on 12GB-or-less GPUs, you’ll be using the previous-generation Qwen 3 / 3.5 there.

Generation speed across all models

Generation speed (tok/s) — all models, measured

Gemma 4 E4B local

110 tok/s
Gemma 4 26B MoE local

107 tok/s
Qwen 3.6 35B-A3B MoE local

97 tok/s
GPT-5.5 API

~61 tok/s
Gemini 2.5 Flash API

58 tok/s
Claude Sonnet 4.6 API

~48 tok/s
Qwen 3.6 27B local

39 tok/s
Gemma 4 12B local

31 tok/s
Gemma 4 31B local

24 tok/s
Gemini 2.5 Pro API

15 tok/s

Blue = Gemma 4 / green = Qwen / orange = Gemini / gray = published API figures (Artificial Analysis). Local models measured on RTX 3090 + Q4_K_M + Thinking OFF. API speeds depend on your network.

MoE (Mixture of Experts) models run only a fraction of their parameters per inference step, so they’re far faster than dense models. Gemma 4 26B MoE, with about 4B active parameters, hit 107 tok/s — beating the paid APIs GPT-5.5 (about 61 tok/s) and Claude Sonnet 4.6 (about 48 tok/s), a clear illustration of the speed advantage of local MoE models. That said, API speeds swing with server load and network conditions, so treat them as reference points only (speed data source: Artificial Analysis).

If you’re on a 24GB GPU (RTX 3090 / 4090)

With 24GB of VRAM, every model is on the table. The four main candidates were:

For 24GB GPUs — all candidate models compared

Metric Qwen 3.6 27B Gemma 4 31B Gemma 4 26B MoE Qwen 3.6 35B-A3B MoE
Type dense dense MoE (4B active) MoE (3B active)
Speed (measured) 39 tok/s 24 tok/s 107 tok/s 97 tok/s
GPQA 5Q (measured) 4/5 4/5 4/5 4/5
VRAM used ~19GB ~20GB (2 GPUs) ~18GB ~25GB (2 GPUs)
Coding
Multimodal Images & video Images Images Images & video
BenchLM overall 73 64

For quality, Qwen 3.6 27B. Its BenchLM overall score beat Gemma 4 31B, 73 to 64, and on coding benchmarks the gap was wide — an average of 70.6 vs 41.6. Speed is a practical 39 tok/s, and it fits on a single 24GB GPU. Its SWE-bench Verified pass rate was 77.2%. That still trails the paid Claude Opus 4.8 (88.6%) and GPT-5.5 (88.7%), but it’s a remarkable level for a model you can run locally for free.

For speed, Gemma 4 26B MoE. At 107 tok/s it’s about three times faster than Qwen 3.6 27B. Being MoE it has only 4B active parameters, but output quality and math reasoning had no practical issues. It’s plenty for chat and light writing.

For dense-model quality, Gemma 4 31B. This model runs all of its parameters rather than using MoE, and on benchmarks it came close to Qwen 3.6 27B. But at 24 tok/s it was the slowest, and even on a 24GB GPU it needs a 2-GPU split, so you have to be ready for the speed hit.

If you’re on a 12GB GPU (RTX 3060 / 4060)

With 12GB of VRAM, no size of Qwen 3.6 fits (even the smallest, 27B, needs about 19GB). Your options are Gemma 4’s smaller models or the previous-generation Qwen 3 / 3.5.

For 12GB GPUs — candidate models compared

Metric Gemma 4 12B Gemma 4 E4B Qwen 3.5 9B Qwen 3 8B
Speed 35 tok/s 71 tok/s 99 tok/s 127 tok/s
VRAM used ~8GB ~6GB ~7GB ~6GB
Math reasoning
Quality
Multimodal Images + audio Images + audio Images None
Generation Latest Latest Prev. gen 2 gens back

For quality, Gemma 4 12B. It’s the newest model in the 12B class, with the highest output quality and well-structured answers. Speed is a modest 35 tok/s, but it fit comfortably on a single 12GB GPU.

For speed, Gemma 4 E4B. It manages a practical 71 tok/s. It accepts image and audio input — both E4B and 12B of Gemma 4 handle audio (Qwen 3.5 9B supports image input). Output quality is slightly below Gemma 4 12B, but it was plenty for chat use.

If you’re on an 8GB-or-less GPU

On 8GB or less (RTX 3050, GTX 1070, and the like), the realistic options are Gemma 4 E4B (~6GB VRAM) or Qwen 3 8B (~6GB VRAM). Qwen 3 8B is the fastest at 127 tok/s, but as an older generation it lagged on benchmarks. If you don’t need multimodal, picking Qwen 3 8B for the speed is one reasonable move.

Output quality: a hands-on comparison

I compared with the same prompt: “the pros and cons of running a local LLM on a home PC, in about 500 words." Here are the answers from the two main 24GB-class models.

Gemma 4 31B’s answer

$ ollama chat gemma4:31b
The biggest advantages of running a local LLM on your home PC are complete privacy and customizability. Unlike cloud services, your data isn’t sent to an external server, so you can process confidential documents or personal information with peace of mind. And if you choose a model without content filters, you’re free to write creatively however you like, or fine-tune it for a specific specialized field.

On the other hand, the drawbacks are hardware cost and performance limits. Running it comfortably needs a high-end GPU (one with lots of VRAM), which means an outlay of tens of thousands to hundreds of thousands of yen to get started. (truncated)

Qwen 3.6 27B’s answer

$ ollama chat qwen3.6:27b
A local LLM’s biggest advantages are data privacy and offline operation. For instance, you can summarize and analyze a company’s confidential contracts or an individual’s medical records safely inside your PC, without sending them to the cloud. And because no internet connection is required, you can use it reliably even on a plane or somewhere with poor connectivity. On cost, after the initial investment there are no subscription fees, so it’s economical over the long run.

On the other hand, the drawbacks are the high hardware requirements and inference speed. You need a high-end GPU (16GB+ VRAM recommended) and plenty of memory, and it struggles to run on an older PC. (truncated)

Gemma 4 31B tends to structure with bold text and headings, while Qwen 3.6 27B summarized the key points more concisely. Both produced natural prose, and both held up well on a formal-rewrite test too. Models of 12B and under are fine as far as basic language goes, but 27B-and-up models pull ahead in the depth of their answers and the richness of their examples.

Where they stand against paid models

How to read the benchmarks

A note on the benchmarks cited in this article.

Benchmark What it measures Questions Difficulty
GPQA Diamond PhD-level science reasoning. Four-choice questions written by physics, chemistry, and biology experts; non-experts score poorly even with Google 198 Random guessing = 25%. Even experts ~65%
SWE-bench Verified Read an issue (bug report) from a real GitHub Python project, generate a code patch, and fix it. Correct if the tests pass 500 Measures real-world software-engineering ability
MMLU / MMLU-Pro A knowledge test spanning 57 fields (STEM, humanities, social science, and more). Pro is a harder 10-choice version ~14,000 / 12,000 Broad knowledge, college to expert level

Here’s how the models actually did on five sample GPQA Diamond questions (biology, chemistry, physics), run locally.

GPQA Diamond, 5 sample questions: number correct (measured)

Gemma 4 31B

4/5 (80%)
Qwen 3.6 27B

4/5 (80%)
Gemma 4 26B MoE

4/5 (80%)
Qwen 3.6 35B-A3B MoE

4/5 (80%)
Gemma 4 12B

2/5 (40%)
Gemma 4 E4B

2/5 (40%)
Gemini 2.5 Flash API

1/5 (20%)

Run with Thinking ON. Only 5 questions, so treat as a rough guide. Gemini 2.5 Flash was via API (thinking not specified).

What stands out is that the MoE models (Gemma 4 26B MoE, Qwen 3.6 35B-A3B) matched the dense models at 80%. MoE has only 3–4B active parameters, but because it picks the best expert from all parameters for each token, its reasoning accuracy was higher than that small active count would suggest. Below 12B it fell to 40%, a clear sign of the model-size wall. The paid-model comparison below cites official benchmark results.

I’ve organized the gap between locally runnable models and paid cloud APIs using public benchmarks.

Compared with paid models (public benchmarks)

Benchmark Qwen 3.6 27B Gemma 4 31B Claude Sonnet 4.6 Claude Opus 4.8 GPT-5.5 Gemini 3 Pro
GPQA Diamond 87.8% 85.7% 89.9% 93.6% 91.9%
SWE-bench Verified 77.2% 88.6% 88.7% ~78%
MMLU 92.4% ~90%
MMLU-Pro 86.2% 85.2%
Pricing Free (local) Free (local) Metered API Metered API Metered API Metered API

Sources: BenchLM.ai, Anthropic, OpenAI, Google official (April–June 2026). Green headers are local models.

On GPQA Diamond (PhD-level science reasoning), the locally runnable Qwen 3.6 27B (87.8%) and Gemma 4 31B (85.7%) have closed in on Claude Sonnet 4.6 (89.9%). When I put Gemini 2.5 Flash through the same five-question sample via API, it scored 1/5 (20%). Flash is a speed-focused model, so it gives ground on reasoning accuracy. On SWE-bench Verified, Qwen 3.6 27B’s 77.2% trails Claude Opus 4.8 (88.6%) and GPT-5.5 (88.7%) — on coding, the paid models lead by more than 10 points.

Differences in Thinking mode

Both Gemma 4 31B and Qwen 3.6 27B have a Thinking mode (think before answering). In Ollama you control it with "think": true/false.

Their behavior differed. Gemma 4 31B generates about 1,000 characters of thinking before answering. Qwen 3.6 27B tended to generate more than 5,000 characters of detailed thinking. Thinking tokens count toward the generation cap (num_predict), so if you turn Thinking ON for Qwen 3.6 27B, you need to set num_predict to 2048 or higher — otherwise the thinking can eat the whole budget and the answer comes back empty.

Summary — the best model for each VRAM tier

Recommended model by VRAM

VRAM Best for quality Best for speed Notes
24GB Qwen 3.6 27B (39 tok/s) Gemma 4 26B MoE (107 tok/s) For dense-model quality, Gemma 4 31B
12GB Gemma 4 12B (35 tok/s) Gemma 4 E4B (71 tok/s) Qwen 3.6 doesn’t fit in 12GB
8GB or less Gemma 4 E4B (71 tok/s) Qwen 3 8B (127 tok/s) For audio input too, Gemma 4 E4B

If you have a 24GB GPU, Qwen 3.6 27B is the overall best. It’s well-balanced on speed, quality, and benchmarks, and on SWE-bench it showed performance approaching Claude Opus 4.8. Ideally, pair it with Gemma 4 26B MoE for the moments when you want speed.

On a 12GB GPU, Gemma 4 12B is the solid pick. Since Qwen 3.6 doesn’t offer a model in this VRAM tier, Gemma 4 is effectively the best option. When you need speed, switch to Gemma 4 E4B.

References

In 2026, open-weight models run at a level rivaling paid models if you have a 24GB GPU. Even at 12GB you can now get practical quality. Start by checking your GPU’s VRAM and picking a model from the tables above.

Hardware used for testing

[kimono_product id="15761″]

[kimono_product id="15759″]