5 Local LLMs Benchmarked at Home: 31B to 235B (2026)

2026年7月5日

When I run local LLMs (large language models) on my home PC, the question I always wrestle with is “which model size is actually practical?" On a single machine at hand, I ran five models from 31B to 235B and measured generation speed, the GPU/CPU split, and a bit of “smartness."

Bottom line first: what decides speed isn’t the model’s “total parameter count," but “how much of the 'active parameters’ actually used in one generation sits in fast memory (VRAM)." A MoE with a small active set that fits in VRAM was the fastest, and even huge models slowed to a crawl the moment they spilled out of VRAM.

* The figures in this article are measured on the physical machine below (verified: June 16, 2026). They change with the model, quantization, settings, and environment.

1. The test environment
2. Measurement method
3. Results: generation speed of the five models
4. What decided the speed: active parameters × memory
5. Why the 70B slowed down so much
6. The MoE “breakthrough": 120B faster than 70B
7. The limit challenge: taking on 235B
- 7.1. Active 22B, so why slow?
8. How was the smartness?
9. Conclusion: which size is practical
10. Gear used for testing
11. References

The test environment

Here’s the PC I ran everything on.

For how to choose a GPU, see the full GPU spec list; for a real dual-card example, see the dual-GPU article; and for setting up Ollama, see the Ollama setup article.

Measurement method

To keep it fair, I matched the conditions across all models.

I used Ollama’s API, fixed the context length at 4096, set temperature=0, and measured with the same prompt. For generation speed (tokens per second), I ran the same generation multiple times and took the median (measured with no heavy processing running in the background). For smartness, I had them solve small problems — simultaneous equations, logic, code, hallucination traps, instruction-following — and checked the answers.

Let me also line up the terms. VRAM is the GPU’s memory capacity; active parameters are the subset of weights actually used to make one token in a MoE (Mixture of Experts); quantization is a technique that compresses a model to shrink its size (e.g., Q2 is about 2-bit, Q4 about 4-bit).

Results: generation speed of the five models

Let’s go through what each one means.

What decided the speed: active parameters × memory

The processing by which an LLM generates text (decode = producing one word at a time) reads the model’s weights from memory each time. Speed is roughly determined by “the amount of weights read per token ÷ the speed of that memory." Between VRAM (about 936GB/s on the RTX 3090) and system DDR4 RAM (roughly 40–50GB/s), there’s about a 20x difference in read/write speed.

There are two key points.

Does it fit in VRAM, or does it spill? If the model fits in the 36GB of VRAM, it all runs on fast VRAM. Whatever spills is processed by the CPU on the slower DDR4 side, and that’s what dragged things down.

The amount read each time = active parameters. A dense model reads all parameters every token, but a MoE uses only some of its “experts," so it reads far less.

Why the 70B slowed down so much

The clearest case was llama3.3:70b (dense, about 42GB at Q4). Measured at just 3.5 tok/s. The cause is clear. The 42GB body didn’t fit in the 36GB of VRAM, and about 23% was evicted to the DDR4 RAM side (Ollama showed “23%/77% CPU/GPU"). The slow DDR4 became the limiting factor, holding the whole thing to single-digit speed.

gemma4:31b (21GB) fit in VRAM and ran at 24.5 tok/s on 100% GPU. A cliff of “fast if it fits / about 1/7 the speed if it spills" showed up clearly within the same PC.

Also, increasing the context length puts even more pressure on VRAM via the KV cache. The 70B dropped to 2.4 tok/s at a 32k setting (3.5 at 4k). The more long text you handle, the worse the disadvantage.

The MoE “breakthrough": 120B faster than 70B

The interesting one is gpt-oss:120b (MoE, active 5.1B, 69GB). It’s larger than the 70B, and more than half (51%) spilled to RAM, yet generation was 17.6 tok/s — about 5x the 70B dense.

The reason is the active parameters. Even at 120B, each token only reads the active 5.1B. Whereas the 70B dense reads all 70B every time, the 120B reads an order of magnitude less. So even spilling to RAM, it kept running fast. “Even if the total is huge and it doesn’t all fit in VRAM, a MoE with a small active set delivers practical speed" — this was the key to running large models locally.

The limit challenge: taking on 235B

Finally, I took on the one with the largest total parameters, Qwen3-235B-A22B (Q2 quantized, about 85GB). This PC has 36GB VRAM + 62GB RAM = 98GB total, so compressed to Q2 it just barely fits in theory.

But getting it to run was a struggle.

Ollama can’t directly read a split GGUF (235B Q2 is a two-file set), and even with a local path it couldn’t concatenate from just the first file. Another runtime (the CUDA build of llama-cpp-python) didn’t match this CPU’s (Zen2) instruction set and crashed on startup. In the end, I built llama.cpp’s tools from source, merged the two files into one, imported it into Ollama, and got it running.

The result was about 3.9 tok/s of generation. Loading took 97 seconds, memory totaled 119GB resident, and the split was 71% on the RAM/CPU side, only 29% in VRAM. The output held together reasonably well, and it correctly set up the simultaneous equations. But because I compressed all the way to Q2, very occasionally I saw garbled characters.

Active 22B, so why slow?

You might think “if it’s a MoE with active 22B, shouldn’t it be fast?" — but two disadvantages piled up.

One is that active 22B is “not small." It’s about 4x gpt-oss’s active 5.1B, so the amount read and computed per token is also about 4x. A simple calculation gives about 1/4 of gpt-oss’s 17.6 tok/s = around 4.4 tok/s, which roughly matches the measured 3.9. The other is that most of it (71%) is CPU-processed on the slow DDR4 side. These two combined dropped the largest-total 235B down to about the speed of the 70B dense.

It’s not that the 235B is inherently slow; it’s slowed by this PC’s constraint of 36GB VRAM + slow DDR4. If the 235B all sat in fast, large-capacity memory (like 256–512GB of unified memory), even at active 22B it should run much faster. Herein lies the reason large-capacity, high-bandwidth unified-memory machines (a Mac Studio or a large-capacity mini PC) exist. A comparison of large-memory machines is collected in the large-memory PC comparison article.

How was the smartness?

On this round’s small problems (simultaneous equations, logic, code, hallucination traps, instruction-following), all five models answered the basic tasks largely correctly. Each set up the simultaneous equations correctly, all models got the “pick the thing that doesn’t exist (a lunar city)" trap right, and they obediently followed the instruction to output JSON. For basic uses, the impression was that the gap in “does it answer properly?" is small compared to size. The 235B, due to Q2 compression, very rarely had glitches — worth keeping in mind that the harder you push quantization, the more quality drops.

Conclusion: which size is practical

Here’s my takeaway on this PC (36GB VRAM + 62GB RAM).

The most comfortable were a MoE with a small active set that fits in VRAM (qwen3.6:35b-a3b = 95.5 tok/s) and a mid-size dense model (gemma4:31b = 24.5 tok/s). For everyday use, this band feels like the realistic answer.
Even for large models, a MoE with a small active set (gpt-oss:120b = 17.6 tok/s) delivered practical speed even spilling out of VRAM. It’s the “sweet" band where you get both knowledge and speed.
The 70B dense and the 235B ran at single-digit speeds because they didn’t fit in VRAM and evicted large amounts to slow RAM. They run, but everyday use feels rough. To use them fast, you’d need more VRAM (like 24GB×2) or a large-capacity, high-bandwidth unified-memory machine.

Local LLMs aren’t “bigger is better." The shortcut to practicality was to pick a model whose “active set fits" in your VRAM.

Gear used for testing

The measurements in this article were done on this PC. Prices and stock fluctuate, so check the latest information before buying.

NVIDIA RTX 3090 (VRAM 24GB, used street price about ¥130–180k): the core for comfortably running mid-size models that fit in VRAM
NVIDIA RTX 3060 (VRAM 12GB): for support and small models
DDR4 memory (64GB): if you just want to “make it run" for large models that don’t fit in VRAM, capacity helps (though speed is hard to get)
For options in large-memory PCs that run big LLMs on a single machine, see the large-memory PC comparison article

References

Introducing gpt-oss (OpenAI) — confirming gpt-oss 120b’s total and active parameters
Qwen3-235B-A22B (Hugging Face) — confirming the MoE configuration of total 235B / active 22B
Llama-3.3-70B-Instruct-GGUF (Hugging Face) — confirming the ~42GB file size at Q4_K_M
GeForce RTX 3090 (NVIDIA) — confirming the ~936GB/s memory bandwidth

▶ Go deeper on local AI (related)

Benchmarks,Local AI