Running huge LLMs (235B, 745B) on consumer & mini-PC hardware — hands-on

Can consumer and mini-PC hardware actually run huge language models — 100B, 235B, even 745B? I bought the gear, ran the models, and measured it. Here are the real numbers from two very different machines, plus the one setting that decides whether a big model runs at a usable speed or stalls forever.

Sponsored

The two test machines

  • Desktop: a single RTX 3090 (24 GB VRAM) + 62 GB system RAM, NVMe SSD (~1.7 GB/s sequential read).
  • Mini-PC: a GMKtec EVO-X2 (AMD Ryzen AI Max, Radeon 8060S iGPU) with 128 GB unified memory and a fast SSD (~3.1 GB/s). Its Vulkan device exposes ~116 GB as “VRAM".

Everything below is hands-on: same prompts, measured tokens per second, using Ollama and llama.cpp.

The big finding: mmap is what stalls large models

On the EVO-X2, large models (73 GB and up) appeared to “hang" — they loaded, then never produced a token. The disk isn’t the problem (it reads at 3.1 GB/s). The culprit is memory-mapped loading (mmap=true, the default), which faults model weights in as slow, random reads. Turning it off changes everything:

Model Size Default (mmap) use_mmap:false
GLM-4.5-Air 73 GB stalls ✅ 22s load, 14.7 tok/s
Qwen3-235B (Q2) 86 GB stalls ✅ 27s load, 19.9 tok/s
MiniMax-M2 (108B-class) 108 GB stalls ❌ OOM (too big)
gpt-oss:120b 65 GB ✅ 35 tok/s

So on a 128 GB mini-PC, the practical ceiling is around 86–90 GB — and a 235B mixture-of-experts model runs at a genuinely usable ~20 tokens/second. The 108 GB model is just over the line and OOMs.

Sponsored

Forcing the “impossible": a 745B model on one RTX 3090

GLM-5.2 is a 745B model. Even at the smallest 1.5-bit quantization it is 202 GB — far beyond any single consumer GPU. Can you brute-force it with SSD as virtual memory? Yes, with two tricks:

  1. --no-warmup. Without it, llama.cpp’s startup warmup tries to touch all 202 GB of experts and never finishes (my first attempt produced zero tokens in five hours). With it, the model loads in ~40 seconds.
  2. Keep the MoE experts on the CPU (-cmoe) and let the SSD page them in on demand.

Result: GLM-5.2 745B answered correctly (“The capital of Japan is Tokyo.") at 0.21 tok/s on the RTX 3090 desktop, and 0.55 tok/s on the EVO-X2 (2.6× faster, thanks to more fast memory and a quicker SSD). With light tuning (-ncmoe 50 plus fewer active experts) the EVO-X2 reached 0.76 tok/s.

Is 0.2–0.76 tok/s practical? No — a sentence takes minutes. But the point stands: the limit of “brute-forcing" a giant model isn’t whether it starts, it’s how slow generation is. And that speed scales cleanly with fast-memory size and SSD speed.

What this means if you’re buying hardware

  • Want to actually use 100B-class models? A high-unified-memory mini-PC (like the EVO-X2) runs ~86 GB models at usable speed — including a 235B MoE at ~20 tok/s.
  • Have a single 24 GB GPU? You’re comfortable up to ~30B-class models; anything huge only “runs" as a very slow curiosity.
  • Chasing the biggest models? Fast memory capacity and SSD throughput matter more than raw GPU horsepower once the model spills to disk.

All figures are first-party measurements (Ollama / llama.cpp, Vulkan build), July 2026. Your results will vary with quantization, drivers and cooling.

Sponsored