Gemma 4 vs Qwen 3.6: Which Should You Run on Your GPU?
I dropped Google’s Gemma 4 and Alibaba’s Qwen 3.6 onto my home PC with an RTX 3090 and compared their performance. Both are open-weight LLMs (large language models) released in April 2026, and each comes in several sizes.
In this article I measured generation speed, output quality, math reasoning, and coding ability to figure out “which model fits your GPU’s VRAM." Bottom line: on a 24GB GPU, Qwen 3.6 27B was the best balance of quality and speed; on a 12GB GPU, Gemma 4 12B was.
Measured June 2026, via Ollama with Q4_K_M quantization.
- 1. Test setup
- 2. The models compared
- 3. Generation speed across all models
- 4. If you’re on a 24GB GPU (RTX 3090 / 4090)
- 5. If you’re on a 12GB GPU (RTX 3060 / 4060)
- 6. If you’re on an 8GB-or-less GPU
- 7. Output quality: a hands-on comparison
- 8. Where they stand against paid models
- 9. Differences in Thinking mode
- 10. Summary — the best model for each VRAM tier
Test setup
The models compared
In addition to every size of Gemma 4 and Qwen 3.6, I included the previous-generation Qwen 3 / 3.5.
Models under test
| Model | Type | Parameters | File size | Approx. VRAM needed |
|---|---|---|---|---|
| Gemma 4 E4B | dense | 8B | 9.6GB | ~6GB |
| Gemma 4 12B | dense | 12B | 7.4GB | ~8GB |
| Gemma 4 26B | MoE (4B active) | 26B | 17GB | ~18GB |
| Gemma 4 31B | dense | 31B | 19GB | ~20GB |
| Qwen 3.6 27B | dense | 27B | 17GB | ~19GB |
| Qwen 3.6 35B-A3B | MoE (3B active) | 35B | 23GB | ~25GB |
| Qwen 3 8B | dense (prev. gen) | 8B | 5.2GB | ~6GB |
| Qwen 3.5 9B | dense (prev. gen) | 9B | 6.6GB | ~7GB |
| Qwen 3 14B | dense (prev. gen) | 14B | 9.3GB | ~14GB |
VRAM figures are measured values at Q4_K_M quantization. With MoE, all parameters sit in VRAM, but fewer of them run per inference step (= faster).
Qwen 3.6 comes in only two sizes, 27B and 35B-A3B, with nothing in the 8–14B range. Since Qwen 3.6 isn’t an option on 12GB-or-less GPUs, you’ll be using the previous-generation Qwen 3 / 3.5 there.
Generation speed across all models
Generation speed (tok/s) — all models, measured
Blue = Gemma 4 / green = Qwen / orange = Gemini / gray = published API figures (Artificial Analysis). Local models measured on RTX 3090 + Q4_K_M + Thinking OFF. API speeds depend on your network.
MoE (Mixture of Experts) models run only a fraction of their parameters per inference step, so they’re far faster than dense models. Gemma 4 26B MoE, with about 4B active parameters, hit 107 tok/s — beating the paid APIs GPT-5.5 (about 61 tok/s) and Claude Sonnet 4.6 (about 48 tok/s), a clear illustration of the speed advantage of local MoE models. That said, API speeds swing with server load and network conditions, so treat them as reference points only (speed data source: Artificial Analysis).
If you’re on a 24GB GPU (RTX 3090 / 4090)
With 24GB of VRAM, every model is on the table. The four main candidates were:
For 24GB GPUs — all candidate models compared
| Metric | Qwen 3.6 27B | Gemma 4 31B | Gemma 4 26B MoE | Qwen 3.6 35B-A3B MoE |
|---|---|---|---|---|
| Type | dense | dense | MoE (4B active) | MoE (3B active) |
| Speed (measured) | 39 tok/s | 24 tok/s | 107 tok/s | 97 tok/s |
| GPQA 5Q (measured) | 4/5 | 4/5 | 4/5 | 4/5 |
| VRAM used | ~19GB | ~20GB (2 GPUs) | ~18GB | ~25GB (2 GPUs) |
| Coding | ◎ | ○ | ○ | ○ |
| Multimodal | Images & video | Images | Images | Images & video |
| BenchLM overall | 73 | 64 | − | − |
For quality, Qwen 3.6 27B. Its BenchLM overall score beat Gemma 4 31B, 73 to 64, and on coding benchmarks the gap was wide — an average of 70.6 vs 41.6. Speed is a practical 39 tok/s, and it fits on a single 24GB GPU. Its SWE-bench Verified pass rate was 77.2%. That still trails the paid Claude Opus 4.8 (88.6%) and GPT-5.5 (88.7%), but it’s a remarkable level for a model you can run locally for free.
For speed, Gemma 4 26B MoE. At 107 tok/s it’s about three times faster than Qwen 3.6 27B. Being MoE it has only 4B active parameters, but output quality and math reasoning had no practical issues. It’s plenty for chat and light writing.
For dense-model quality, Gemma 4 31B. This model runs all of its parameters rather than using MoE, and on benchmarks it came close to Qwen 3.6 27B. But at 24 tok/s it was the slowest, and even on a 24GB GPU it needs a 2-GPU split, so you have to be ready for the speed hit.
If you’re on a 12GB GPU (RTX 3060 / 4060)
With 12GB of VRAM, no size of Qwen 3.6 fits (even the smallest, 27B, needs about 19GB). Your options are Gemma 4’s smaller models or the previous-generation Qwen 3 / 3.5.
For 12GB GPUs — candidate models compared
| Metric | Gemma 4 12B | Gemma 4 E4B | Qwen 3.5 9B | Qwen 3 8B |
|---|---|---|---|---|
| Speed | 35 tok/s | 71 tok/s | 99 tok/s | 127 tok/s |
| VRAM used | ~8GB | ~6GB | ~7GB | ~6GB |
| Math reasoning | ◎ | ◎ | ◎ | ◎ |
| Quality | ◎ | ○ | ○ | ○ |
| Multimodal | Images + audio | Images + audio | Images | None |
| Generation | Latest | Latest | Prev. gen | 2 gens back |
For quality, Gemma 4 12B. It’s the newest model in the 12B class, with the highest output quality and well-structured answers. Speed is a modest 35 tok/s, but it fit comfortably on a single 12GB GPU.
For speed, Gemma 4 E4B. It manages a practical 71 tok/s. It accepts image and audio input — both E4B and 12B of Gemma 4 handle audio (Qwen 3.5 9B supports image input). Output quality is slightly below Gemma 4 12B, but it was plenty for chat use.
If you’re on an 8GB-or-less GPU
On 8GB or less (RTX 3050, GTX 1070, and the like), the realistic options are Gemma 4 E4B (~6GB VRAM) or Qwen 3 8B (~6GB VRAM). Qwen 3 8B is the fastest at 127 tok/s, but as an older generation it lagged on benchmarks. If you don’t need multimodal, picking Qwen 3 8B for the speed is one reasonable move.
Output quality: a hands-on comparison
I compared with the same prompt: “the pros and cons of running a local LLM on a home PC, in about 500 words." Here are the answers from the two main 24GB-class models.
Gemma 4 31B’s answer
On the other hand, the drawbacks are hardware cost and performance limits. Running it comfortably needs a high-end GPU (one with lots of VRAM), which means an outlay of tens of thousands to hundreds of thousands of yen to get started. (truncated)
Qwen 3.6 27B’s answer
On the other hand, the drawbacks are the high hardware requirements and inference speed. You need a high-end GPU (16GB+ VRAM recommended) and plenty of memory, and it struggles to run on an older PC. (truncated)
Gemma 4 31B tends to structure with bold text and headings, while Qwen 3.6 27B summarized the key points more concisely. Both produced natural prose, and both held up well on a formal-rewrite test too. Models of 12B and under are fine as far as basic language goes, but 27B-and-up models pull ahead in the depth of their answers and the richness of their examples.
Where they stand against paid models
How to read the benchmarks
A note on the benchmarks cited in this article.
| Benchmark | What it measures | Questions | Difficulty |
|---|---|---|---|
| GPQA Diamond | PhD-level science reasoning. Four-choice questions written by physics, chemistry, and biology experts; non-experts score poorly even with Google | 198 | Random guessing = 25%. Even experts ~65% |
| SWE-bench Verified | Read an issue (bug report) from a real GitHub Python project, generate a code patch, and fix it. Correct if the tests pass | 500 | Measures real-world software-engineering ability |
| MMLU / MMLU-Pro | A knowledge test spanning 57 fields (STEM, humanities, social science, and more). Pro is a harder 10-choice version | ~14,000 / 12,000 | Broad knowledge, college to expert level |
Here’s how the models actually did on five sample GPQA Diamond questions (biology, chemistry, physics), run locally.
GPQA Diamond, 5 sample questions: number correct (measured)
Run with Thinking ON. Only 5 questions, so treat as a rough guide. Gemini 2.5 Flash was via API (thinking not specified).
What stands out is that the MoE models (Gemma 4 26B MoE, Qwen 3.6 35B-A3B) matched the dense models at 80%. MoE has only 3–4B active parameters, but because it picks the best expert from all parameters for each token, its reasoning accuracy was higher than that small active count would suggest. Below 12B it fell to 40%, a clear sign of the model-size wall. The paid-model comparison below cites official benchmark results.
I’ve organized the gap between locally runnable models and paid cloud APIs using public benchmarks.
Compared with paid models (public benchmarks)
| Benchmark | Qwen 3.6 27B | Gemma 4 31B | Claude Sonnet 4.6 | Claude Opus 4.8 | GPT-5.5 | Gemini 3 Pro |
|---|---|---|---|---|---|---|
| GPQA Diamond | 87.8% | 85.7% | 89.9% | 93.6% | − | 91.9% |
| SWE-bench Verified | 77.2% | − | − | 88.6% | 88.7% | ~78% |
| MMLU | − | − | − | − | 92.4% | ~90% |
| MMLU-Pro | 86.2% | 85.2% | − | − | − | − |
| Pricing | Free (local) | Free (local) | Metered API | Metered API | Metered API | Metered API |
Sources: BenchLM.ai, Anthropic, OpenAI, Google official (April–June 2026). Green headers are local models.
On GPQA Diamond (PhD-level science reasoning), the locally runnable Qwen 3.6 27B (87.8%) and Gemma 4 31B (85.7%) have closed in on Claude Sonnet 4.6 (89.9%). When I put Gemini 2.5 Flash through the same five-question sample via API, it scored 1/5 (20%). Flash is a speed-focused model, so it gives ground on reasoning accuracy. On SWE-bench Verified, Qwen 3.6 27B’s 77.2% trails Claude Opus 4.8 (88.6%) and GPT-5.5 (88.7%) — on coding, the paid models lead by more than 10 points.
Differences in Thinking mode
Both Gemma 4 31B and Qwen 3.6 27B have a Thinking mode (think before answering). In Ollama you control it with "think": true/false.
Their behavior differed. Gemma 4 31B generates about 1,000 characters of thinking before answering. Qwen 3.6 27B tended to generate more than 5,000 characters of detailed thinking. Thinking tokens count toward the generation cap (num_predict), so if you turn Thinking ON for Qwen 3.6 27B, you need to set num_predict to 2048 or higher — otherwise the thinking can eat the whole budget and the answer comes back empty.
Summary — the best model for each VRAM tier
Recommended model by VRAM
| VRAM | Best for quality | Best for speed | Notes |
|---|---|---|---|
| 24GB | Qwen 3.6 27B (39 tok/s) | Gemma 4 26B MoE (107 tok/s) | For dense-model quality, Gemma 4 31B |
| 12GB | Gemma 4 12B (35 tok/s) | Gemma 4 E4B (71 tok/s) | Qwen 3.6 doesn’t fit in 12GB |
| 8GB or less | Gemma 4 E4B (71 tok/s) | Qwen 3 8B (127 tok/s) | For audio input too, Gemma 4 E4B |
If you have a 24GB GPU, Qwen 3.6 27B is the overall best. It’s well-balanced on speed, quality, and benchmarks, and on SWE-bench it showed performance approaching Claude Opus 4.8. Ideally, pair it with Gemma 4 26B MoE for the moments when you want speed.
On a 12GB GPU, Gemma 4 12B is the solid pick. Since Qwen 3.6 doesn’t offer a model in this VRAM tier, Gemma 4 is effectively the best option. When you need speed, switch to Gemma 4 E4B.
References
- BenchLM.ai — Gemma 4 31B vs Qwen 3.6 27B comparison
- Google — Gemma 4 announcement blog
- Alibaba — Qwen 3.6 27B announcement blog
- Anthropic — Claude Opus 4.8 announcement
- OpenAI — GPT-5.5 announcement
- Epoch AI — GPQA Diamond benchmark explainer
- Epoch AI — SWE-bench Verified benchmark explainer
- Ollama — Gemma 4 model page
- Ollama — Qwen 3.6 model page
In 2026, open-weight models run at a level rivaling paid models if you have a 24GB GPU. Even at 12GB you can now get practical quality. Start by checking your GPU’s VRAM and picking a model from the tables above.





Recent Comments