Running huge LLMs (235B, 745B) on consumer & mini-PC hardware — hands-on

2026年7月4日

Can consumer and mini-PC hardware actually run huge language models — 100B, 235B, even 745B? I bought the gear, ran the models, and measured it. Here are the real numbers from two very different machines, plus the one setting that decides whether a big model runs at a usable speed or stalls forever.

Contents

1. The two test machines
2. The big finding: mmap is what stalls large models
3. Forcing the “impossible": a 745B model on one RTX 3090
4. What this means if you’re buying hardware

The two test machines

Desktop: a single RTX 3090 (24 GB VRAM) + 62 GB system RAM, NVMe SSD (~1.7 GB/s sequential read).
Mini-PC: a GMKtec EVO-X2 (AMD Ryzen AI Max, Radeon 8060S iGPU) with 128 GB unified memory and a fast SSD (~3.1 GB/s). Its Vulkan device exposes ~116 GB as “VRAM".

Everything below is hands-on: same prompts, measured tokens per second, using Ollama and llama.cpp.

The big finding: `mmap` is what stalls large models

On the EVO-X2, large models (73 GB and up) appeared to “hang" — they loaded, then never produced a token. The disk isn’t the problem (it reads at 3.1 GB/s). The culprit is memory-mapped loading (mmap=true, the default), which faults model weights in as slow, random reads. Turning it off changes everything:

Model	Size	Default (mmap)	use_mmap:false
GLM-4.5-Air	73 GB	stalls	✅ 22s load, 14.7 tok/s
Qwen3-235B (Q2)	86 GB	stalls	✅ 27s load, 19.9 tok/s
MiniMax-M2 (108B-class)	108 GB	stalls	❌ OOM (too big)
gpt-oss:120b	65 GB	✅ 35 tok/s	—

So on a 128 GB mini-PC, the practical ceiling is around 86–90 GB — and a 235B mixture-of-experts model runs at a genuinely usable ~20 tokens/second. The 108 GB model is just over the line and OOMs.

Forcing the “impossible": a 745B model on one RTX 3090

GLM-5.2 is a 745B model. Even at the smallest 1.5-bit quantization it is 202 GB — far beyond any single consumer GPU. Can you brute-force it with SSD as virtual memory? Yes, with two tricks:

--no-warmup. Without it, llama.cpp’s startup warmup tries to touch all 202 GB of experts and never finishes (my first attempt produced zero tokens in five hours). With it, the model loads in ~40 seconds.
Keep the MoE experts on the CPU (-cmoe) and let the SSD page them in on demand.

Result: GLM-5.2 745B answered correctly (“The capital of Japan is Tokyo.") at 0.21 tok/s on the RTX 3090 desktop, and 0.55 tok/s on the EVO-X2 (2.6× faster, thanks to more fast memory and a quicker SSD). With light tuning (-ncmoe 50 plus fewer active experts) the EVO-X2 reached 0.76 tok/s.

Is 0.2–0.76 tok/s practical? No — a sentence takes minutes. But the point stands: the limit of “brute-forcing" a giant model isn’t whether it starts, it’s how slow generation is. And that speed scales cleanly with fast-memory size and SSD speed.

What this means if you’re buying hardware

Want to actually use 100B-class models? A high-unified-memory mini-PC (like the EVO-X2) runs ~86 GB models at usable speed — including a 235B MoE at ~20 tok/s.
Have a single 24 GB GPU? You’re comfortable up to ~30B-class models; anything huge only “runs" as a very slow curiosity.
Chasing the biggest models? Fast memory capacity and SSD throughput matter more than raw GPU horsepower once the model spills to disk.

All figures are first-party measurements (Ollama / llama.cpp, Vulkan build), July 2026. Your results will vary with quantization, drivers and cooling.

The two test machines

The big finding: mmap is what stalls large models

Forcing the “impossible": a 745B model on one RTX 3090

What this means if you’re buying hardware

The big finding: `mmap` is what stalls large models