Running a Local AI Chatbot at Home: A Budget-by-Budget Guide
I run two GPUs in my main PC and use generative AI locally as well. I use ChatGPT and Claude too, but when I have them summarize work documents, I’ve increasingly caught myself wondering, “Is it really OK to send this outside?" And the monthly fees slowly add up.
So I set out to map, budget by budget, just how far you can take an AI chatbot on nothing but a home GPU.
* This article focuses on consumer GPUs that fit in an ordinary desktop PC (NVIDIA GeForce / AMD Radeon series). It doesn’t cover server/data-center GPUs like the NVIDIA A100 or H100 (40–80GB VRAM, ¥1M+). That’s why the VRAM ceiling here stops at 32GB.
- 1. Cloud AI vs. local AI
- 2. Local AI comes down to VRAM
- 3. By GPU brand: which is easiest to get running?
- 4. Getting started: pick from four apps
- 5. What changes across Windows, Mac, and Linux?
- 6. The used-GPU option
- 7. By budget: what your GPU can do
- 8. Value-for-money charts by GPU
- 9. Tokens and text length, roughly
- 10. Value quick-reference
- 11. So which should you actually buy?
- 12. Related articles
- 13. Next steps
Cloud AI vs. local AI
| Cloud AI | Local AI | |
|---|---|---|
| Privacy | Your conversations are sent to a server | Everything stays on your PC. Nothing leaves |
| Monthly cost | ChatGPT Plus ¥3,000/mo / Claude Pro ¥3,000/mo | ¥0 (electricity only; ~50–150W while the GPU runs) |
| Upfront cost | ¥0 | GPU: ¥60k–400k |
| Total cost over one year | about ¥36,000 | GPU + electricity ~¥3,000–6,000/yr |
| Internet | Required | Not needed (works offline) |
| Model smarts | The latest models like GPT-4o / Claude 3.5 | 8B–32B models (depending on your GPU’s VRAM) |
| Response speed | 40–80 tok/s | 20–130 tok/s (depending on GPU) |
* If you’ll use it for more than a year, even a 16GB GPU (about ¥90k) earns back the monthly cloud-AI fee.
For me the biggest thing is that conversations never leave the machine. Summarizing meeting notes, personal questions — being able to use it without a second thought is local AI’s real strength.
Local AI comes down to VRAM
The thing I really felt after trying local LLMs is that “what you can do" is decided almost entirely by how much VRAM (GPU memory) you have.
What determines a local LLM’s performance (longer bar = bigger impact)
| 1. VRAM capacity |
Sets how large a model you can run (most important)
|
| 2. Memory bandwidth |
Directly drives how fast text comes out
|
| 3. GPU compute |
Surprisingly little difference
|
| 4. CPU / RAM |
Secondary
|
If you don’t have enough VRAM, you simply can’t run a smart model. Conversely, as long as you have the VRAM, even middling GPU compute runs at practical speed.
What do “27B" and “8B" mean?
In local-AI articles you often see labels like “8B model" or “27B model." The B (Billion) is the model’s parameter count — the “size of its brain," so to speak. Bigger numbers mean a smarter model, but they also eat more GPU memory (VRAM).
Comparing to AIs you already know makes it easier to picture.
| Model size | Parameters | VRAM needed | A familiar comparison |
|---|---|---|---|
| 2–4B | 2–4 billion | ~2–4GB | About the level of on-phone AI (Apple Intelligence, Gemini Nano). Can summarize text and handle simple exchanges, but weak on anything intricate |
| 8B | 8 billion | ~5–6GB | On par with the free ChatGPT’s lightweight model (GPT-4o mini). Practical for everyday chat and simple questions |
| 14B | 14 billion | ~10–11GB | The line where it starts to surpass the free ChatGPT (GPT-4o mini). Its language gets noticeably more natural. Personally, this is where it becomes genuinely usable |
| 27–32B | 27–32 billion | ~17–22GB | Quality approaching ChatGPT Plus (GPT-4o class). The “wait, this runs locally?" level |
| 70B+ | 70 billion+ | 45GB+ | On par with ChatGPT Plus or better. But it won’t run on a single ordinary GPU |
* ChatGPT’s models (GPT-4o, etc.) don’t publish exact parameter counts, so this is a felt comparison based on benchmarks. Even at the same parameter count, quality varies a lot with the quality and volume of training data and with tuning.
The relationship between VRAM and model size is simple. A model’s parameters have to sit in the GPU’s VRAM, and if there isn’t enough, that model won’t run. For example, 8GB of VRAM handles up to an 8B model, 16GB up to 14B, and 24GB up to 32B. In other words, the amount of VRAM = the ceiling on model size = the ceiling on how smart your AI can be.
Here are numbers I measured myself.
| GPU | Model | Generation speed | VRAM used |
|---|---|---|---|
| RTX 3090 24GB | qwen3.5:27b | ★ 25.5 tok/s | 18.2GB (split across 2 cards) |
| RTX 3090 24GB | qwen3:8b | ★ 126.4 tok/s | 10.3GB |
| RTX 3060 12GB | qwen3:8b | ★ 60.1 tok/s | 5.5GB |
★ = author-measured values (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula.
My PC has an RTX 3090 and an RTX 3060 in it. On the RTX 3090 (24GB), an 8B model screams along at 126 tok/s. Even the RTX 3060 (12GB) runs an 8B comfortably at 60 tok/s. A 27B model slows down on the 3090 alone for lack of VRAM, but split across two cards it runs practically at 25.5 tok/s. The VRAM gap maps directly onto “how smart a model you can use."
When choosing a GPU, put “how much VRAM does it have" first.
Here’s a table of what you can run and how it performs, by VRAM.
| VRAM | Runnable models | Typical models | Speed (approx.) | GPU price range |
|---|---|---|---|---|
| 8GB | 8B | Qwen 3 8B, Llama 3.1 8B, Gemma 3 4B | 60–130 tok/s | ¥60k–70k |
| 12GB | 8B–12B | Gemma 3 12B, Qwen 3 8B (with room) | 35–130 tok/s | ¥50k–80k |
| 16GB | 14B | Qwen 3 14B, DeepSeek-R1 14B, Gemma 3 12B | 23–72 tok/s | ¥80k–160k |
| 24GB | 32B | Qwen 3 32B, Gemma 3 27B, DeepSeek-R1 32B | 20–35 tok/s | ¥180k–250k |
| 32GB | 32B + long context | Qwen 3 32B (32K context) | 50–60 tok/s | ¥400k+ |
★ = author-measured values (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula.
How to read this table: as VRAM climbs 8GB → 16GB → 24GB, the size (= smarts) of the models you can run steps up. If you want practical everyday quality, 16GB (a 14B model) is the minimum line.
Measured: generation speed by model
How to read this chart: a longer bar means faster generation (= more comfortable). gemma4 is the fastest, but for output quality qwen3.5:27b is the best. Speed and smarts are a trade-off.
[kimono_bar title="" unit="tok/s" color="#1e90ff"]
qwen3.5:27b (3090+3060)|26
qwen3.5:9b (3060)|98.8
qwen3:8b (3090)|127
gemma4:9b (3090)|133
[/kimono_bar]
* Test setup: RTX 3090 (24GB) + RTX 3060 12GB / Linux / Ollama / measured April 2026. The 27b model used a 2-GPU split load.
How much VRAM do you need?
| What you want to do | VRAM needed | Model | Speed |
|---|---|---|---|
| Just try out AI | 8GB | 8B (uses 5–6GB) | 60–130 tok/s |
| Use it for practical everyday work | 16GB | 14B (uses 10–11GB) | 23–72 tok/s |
| Rely on it seriously for work | 24GB | 32B (uses 22GB) | 20–35 tok/s |
| The works (AI + VR + image gen) | 32GB | 32B + long context | 50–60 tok/s |
* tok/s = tokens generated per second. At 20 tok/s it’s “a slight wait, but readable"; at 40+ tok/s it “comes back instantly."
By GPU brand: which is easiest to get running?
After VRAM, the next thing that matters is “will it actually run on that GPU?" The amount of setup effort varies quite a bit by GPU brand.
| GPU brand | Setup | Windows | Mac | Linux |
|---|---|---|---|---|
| NVIDIA (CUDA) | Just install the driver | ◎ | – | ◎ |
| AMD | Good on Linux. On Windows, AMD’s AI compute stack (ROCm) is still incomplete, so setup takes effort | △ | – | ○ |
| Apple Silicon | Just install Ollama. Shared memory lets you run large models too | – | ◎ | – |
| Intel (iGPU) | Limited support, and on the slow side | △ | – | △ |
The easiest are NVIDIA (Windows/Linux) and Apple Silicon (Mac).
If you’re on Windows or Linux like me, you can’t go wrong choosing an NVIDIA GPU. Just install the driver and Ollama auto-detects it.
AMD’s appeal is that you can buy the same VRAM cheaper than NVIDIA, but on Windows the software stack for AI (ROCm) is still incomplete and takes fiddling to set up. It’s not yet “install the driver and it works" the way NVIDIA’s CUDA is. If you’re prepared to run Linux, the value for money is unbeatable.
* ROCm = AMD’s software stack for running AI on its GPUs, equivalent to CUDA on NVIDIA. NVIDIA’s CUDA has years of proven stability, while AMD’s ROCm is still maturing and support is limited, especially on Windows.
For Mac users, Apple Silicon’s unified memory is a surprising strength. With 24GB or more, you can run 32B-class models. Speed lags a dedicated NVIDIA GPU, but “a 32B running on a laptop" is a pretty interesting experience.
Getting started: pick from four apps
There are several apps for running local LLMs. I use Ollama, but the best choice is whatever suits you.
Local LLM apps compared
| App | What it’s like | Best for | OS |
|---|---|---|---|
| LM Studio | Everything from model search to chat in a GUI. The most approachable | First-timers | Win/Mac/Linux |
| Ollama + Open WebUI | Set up from the command line; add a browser UI with Open WebUI | People who want to build their own setup | Win/Mac/Linux |
| Jan | Privacy-focused. A self-contained desktop app | People who want it simple | Win/Mac/Linux |
| GPT4All | Lightweight. Few settings, so nothing to get lost in | People who just want a quick try | Win/Mac/Linux |
My personal take: LM Studio to start, Ollama + Open WebUI once you’re in deep.
With LM Studio, you can search, download, and chat with a model right after installing, so if you’re not used to the terminal it’s the easier way in.
I chose Ollama for the nimbleness of switching between models from the command line and for its extensibility, which suit my taste. Day to day, I chat with it from a terminal app.
How to get started with Ollama (for reference)
- Download the installer from ollama.com
- Install it (Windows / Mac / Linux)
- Type
ollama run qwen3:8bin the terminal - Chat begins
On my setup, it auto-detected the GPU right after install and just worked. I never had to fuss with detailed settings.
On my machine (RTX 3090), qwen3:8b generates at about 126 tok/s. It feels like “the reply starts the instant I hit enter." On the RTX 3060 it’s 60 tok/s — the bandwidth gap shows up directly as speed, but it still feels plenty comfortable.
What changes across Windows, Mac, and Linux?
The experience differs quite a bit by OS, so let me lay it out.
| OS | Pros | Cons | Best for |
|---|---|---|---|
| Windows | With NVIDIA, setup is the easiest. Plenty of GUI apps like LM Studio too | Slightly more VRAM overhead than Linux. AMD GPUs take effort to set up | NVIDIA GPU owners who want an easy start |
| Mac | Apple Silicon’s unified memory runs large models. Power-efficient | Slower generation than a dedicated GPU. Pricey hardware | People whose main machine is a Mac; people who want portability |
| Linux | The most memory-efficient. AMD’s AI stack (ROCm) runs stably on Linux too. Easy to run with Docker | Requires technical know-how to set up | AMD GPU owners; people who want to run it server-style |
For beginners or first-timers, my suggestions are:
Windows users → NVIDIA GPU
Mac users → lean on Apple Silicon
Linux users → AMD GPUs come into play too
That’s roughly how it shakes out.
I run mine on Linux with an RTX 3090 + RTX 3060 in tandem. I can run Ollama (chat AI) on one and ComfyUI (image generation) on the other at the same time, and I’m quite fond of this setup.
The used-GPU option
New isn’t the only option. My RTX 3090 was bought at launch for about ¥300k at list price; my secondary RTX 3060 12GB was about ¥40k used.
The two best values on the used market are:
| GPU | VRAM | Used price (shops) | Runs | Notes |
|---|---|---|---|---|
| RTX 3060 12GB | 12GB | ¥20k–35k | 8B models | Cheapest entry point. 12GB for around ¥20k |
| RTX 4060 Ti 16GB | 16GB | ¥70k–100k | 14B models | A hidden gem. 16GB at half the new price |
| RTX 3090 24GB | 24GB | ¥130k–200k | 32B models | Staying high on AI demand |
Note: the RTX 30 series is a generation where many cards were run hard during the mining boom. That said, the RTX 3060 12GB shipped with a mining limiter (LHR) from the start, and its 12GB of VRAM wasn’t needed for mining, so heavily-abused units are relatively rare. The RTX 3080/3090, by contrast, were popular for mining, so take more care. I’d recommend buying from a used shop with a warranty.
By budget: what your GPU can do
From here I’ll break down, by concrete budget tier, which GPU runs what. As noted above, the top criterion is “how many GB of VRAM," and the next is “is it NVIDIA?" I’ve organized this around new-card prices; if you’re also considering used, see the comparison table above.
¥60k–70k tier (RTX 5060 / RTX 5060 Ti 8GB)
[kimono_product id="15770″]
What you can do with 8GB of VRAM:
| Task | Doable? | How it feels |
|---|---|---|
| Everyday Q&A (weather, cooking, small talk) | ◎ | Plenty practical |
| Simple coding help | ○ | OK for short snippets |
| Proofreading text | ○ | Decent even on an 8B |
| Summarizing long text (papers, minutes) | △ | Short context (2K–4K tokens) |
| Complex reasoning / analysis | △ | The limit of an 8B model |
| Translation | ○ | OK for simple sentences |
Runnable models:
| Model | VRAM used | Speed (approx.) | Quality |
|---|---|---|---|
| Qwen 3 8B | ~5.2GB | 65 tok/s | Decent |
| Llama 3.1 8B | ~6.2GB | 56 tok/s | Better in English |
| Gemma 3 4B | ~3.6GB | 112 tok/s | Basic |
★ = author-measured (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula, using the RTX 5060 Ti 8GB (448 GB/s) as the representative GPU.
Enough to experience “so this is what AI is like." But quality is “so-so," and it tends to lose the thread in long conversations. Ideal as a “try it out," but too shaky to rely on for work.
Value: ★★★☆☆ (fine for trying it out)
[kimono_product id="15770″]
¥90k–110k tier (RTX 5060 Ti 16GB / RX 9070)
[kimono_product id="15760″]
What you can do with 16GB of VRAM:
| Task | Doable? | How it feels |
|---|---|---|
| Everyday Q&A | ◎ | Comfortable |
| Coding help (moderate) | ◎ | Practical at the function level |
| Proofreading and rewriting | ◎ | A 14B’s language is quite good |
| Summarizing long text | ○ | Up to 8K–16K tokens |
| Drafting emails | ◎ | Practical |
| Technical Q&A | ○ | As deep as a 14B gets |
| Drafting fiction or blog posts | ○ | Usable as a first draft |
Runnable models:
| Model | VRAM used | Speed (approx.) | Quality | Notes |
|---|---|---|---|---|
| Qwen 3 14B | ~10.7GB | 36–72 tok/s | Good | A notch better in language. Personally, “usable" starts here |
| Gemma 3 12B | ~12.4GB | 27–54 tok/s | Good | Google’s 12B. A balanced pick |
| DeepSeek-R1-Distill 14B | ~11GB | 31–61 tok/s | Fairly good | Strong at reasoning (thinks before answering) |
★ = author-measured (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula, estimated across the bandwidth range from RTX 5060 Ti 16GB (448 GB/s) to RTX 5070 Ti (896 GB/s).
This is the “entrance to practical use." A 14B is clearly smarter than an 8B by feel — naturalness of language, grasp of the question, and accuracy of summaries are on another level. This is the line where you start thinking, “maybe I can drop the paid ChatGPT subscription and get by with this."
That said, the RTX 5060 Ti 16GB has a 128-bit bus, so token generation is slower than higher-end GPUs. Think of it as “a smart friend who talks a little slowly."
The AMD RX 9070 (16GB / about ¥80k) is the cheapest per gigabyte of VRAM, but AMD’s AI stack isn’t as mature as NVIDIA’s. On Windows, setup can take an extra step.
Value: ★★★★☆ (the best-balanced entry to practical use)
[kimono_product id="15760″]
¥160k tier (RTX 5070 Ti 16GB)
[kimono_product id="15762″]
What you can do with 16GB of VRAM (fast):
You can do the same things as the 16GB tier, but the speed is different.
| Comparison | RTX 5060 Ti 16GB | RTX 5070 Ti 16GB |
|---|---|---|
| Qwen 3 14B speed | ~23 tok/s | ~72 tok/s |
| How it feels | “A slight wait" | “Comes back instantly" |
| Doubling as AI image gen | A bit slow | Comfortable |
| Doubling as VR | Entry level | Comfortable |
The most comfortable of the 16GB options. If you also want VR or AI image generation, the ¥60k premium over the 5060 Ti is well worth it. “Overkill for local AI alone, ideal if you’re doubling up with other uses."
Value: ★★★★☆ (best if you’re doubling up)
[kimono_product id="15762″]
¥120k–300k tier (RX 7900 XTX 24GB / RTX 5080 16GB)
[kimono_product id="15771″]
[kimono_product id="15763″]
This is where “serious local AI" begins.
What you can do with 24GB of VRAM (RX 7900 XTX):
| Task | Doable? | How it feels |
|---|---|---|
| Everything above | ◎ | Comfortable |
| 32B models (Qwen 3 32B, etc.) | ◎ | Surprisingly “smarter than expected" |
| Analyzing / summarizing long text | ◎ | 16K–32K tokens is practical |
| Cross-document analysis | ○ | Doable, but slower |
| Coding help (whole files) | ◎ | A 32B’s code comprehension is high |
| Specialized Q&A | ◎ | Solid accuracy on medicine, law, tech, and more |
Runnable models:
| Model | VRAM used | Speed (approx.) | Quality | Notes |
|---|---|---|---|---|
| Qwen 3 32B | ~22.2GB | 32 tok/s | Very good | The “this runs locally?" level |
| Gemma 3 27B | ~22.5GB | 41 tok/s | Very good | Google’s large model |
| DeepSeek-R1-Distill 32B | ~22GB | 32 tok/s | Good | Deep reasoning chains |
★ = author-measured (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula, using RTX 3090 (936 GB/s) bandwidth for the estimate.
A 32B model changes everything. Up to 14B it was “AI-ish, but, well, about what you’d expect"; a 32B brings the “wait, this is running locally?" surprise. Language quality, reasoning depth, and context retention are on another level.
The RX 7900 XTX (24GB / about ¥120k–150k) blows past NVIDIA on price per gigabyte of VRAM, but running AI stably calls for a Linux environment. On Windows, be ready for some configuration.
The RTX 5080 (16GB / about ¥190k–300k) is top-class in speed, but with only 16GB of VRAM it can’t run 32B models. “A fast 14B" or “a VRAM-rich 32B" — this is the biggest fork in the road.
Value: ★★★★★ (the best value tier if you’re serious about local AI)
[kimono_product id="15763″]
[kimono_product id="15771″]
¥400k–610k tier (RTX 5090 32GB)
[kimono_product id="15772″]
What you can do with 32GB of VRAM:
Run 32B models comfortably at very long context (32K+ tokens). Even 32GB isn’t enough for 70B models (which need 45GB+).
Overkill to buy purely for local AI. It’s for the “the works" crowd who want to do VR (120Hz max settings) + AI image generation (FLUX Dev) + local LLM (32B) all on one card.
Value: ★★☆☆☆ (makes sense for the works, too expensive for AI alone)
[kimono_product id="15772″]
Value-for-money charts by GPU
Local LLM value ranking
How to read this chart: a longer bar means higher performance for the price — better value. The value metric is “practical performance score ÷ price (in ¥10k units)."
[kimono_bar title="" color="#1e90ff"]
RTX 5090 32GB [New]|2.4
RTX 4090 24GB [Used]|2.9
RX 7900XTX 24GB [New]|3.3
RTX 5080 16GB [New]|3.5
RTX 4080S 16GB [Used]|4.2
RTX 5070Ti 16GB [New]|4.5
RTX 5060Ti 16GB [New]|4.7
RTX 4070TiS 16GB [Used]|5
RX 9070 16GB [New]|5
RTX 4060Ti 16GB [Used]|3.8
RTX 4070S 12GB [Used]|5.6
RTX 5060Ti 8GB [New]|5.7
RTX 5060 8GB [New]|5.8
RTX 3090 24GB [Used]|5.8
RTX 5070 12GB [New]|6
RTX 3080 12GB [Used]|7.5
RTX 3060 12GB [Used]|8
[/kimono_bar]
* How the practical performance score (out of 100) is computed: 50% for the ceiling model size you can run (VRAM-dependent), 30% for generation speed (bandwidth-dependent), and 20% for context-length headroom (VRAM-headroom-dependent), weighted and summed. Dividing that score by GPU price (in ¥10k units) gives the value metric. The bigger the number, the more performance per ¥10k.
What the chart tells us
- The RX 9070 (16GB / ¥80k) is the best value. But AMD’s AI stack (ROCm) is Linux-recommended, and on Windows setup takes effort
- Among NVIDIA cards, the RTX 5060 Ti 16GB (¥90k–110k) is the value champion. Getting 16GB of VRAM for about ¥90k–110k is the cheapest line for running a 14B model practically
- The RTX 5080 (¥190k–300k) and RTX 5090 (¥400k–610k) are poor value. Performance is high but so is the price, so the metric comes out low. They’re for people with budget to spare, or who double up with non-AI uses (VR, gaming)
- The RTX 5090 (¥400k–610k) is for “the works." Too expensive to buy for LLMs alone, but it makes sense if you’re combining VR + image gen + LLM
Model value ranking (quality per VRAM)
How to read this chart: a longer bar means “higher quality for less VRAM" — better value. The metric is “quality (5-point scale) ÷ required VRAM (GB) × 10." The ★ 14B models (Qwen 3 14B, DeepSeek-R1 14B) are the practical line. Models above them have higher quality but need more VRAM, so their value metric drops.
[kimono_bar title="" color="#1e90ff"]
Gemma 3 27B (22.5GB)|2
Qwen 3 32B (22.2GB)|2
Gemma 3 12B (12.4GB)|2.8
★ DeepSeek-R1 14B (11.0GB)|3.2
★ Qwen 3 14B (10.7GB)|3.7
Llama 3.1 8B (6.2GB)|4
Gemma 3 4B (3.6GB)|5.6
Qwen 3 8B (5.2GB)|5.8
[/kimono_bar]
* The 14B models need about 10–11GB of VRAM and score 4.0/5.0 on quality. Models of 8B and under need less VRAM, so their value metric looks high, but their quality is “so-so." Consider the absolute quality level, not just the value metric. Personally, I feel 14B and up is the minimum line for practical quality.
Tokens and text length, roughly
Throughout the article I use the unit “tok/s" (tokens per second). A token is a chunk of text — very roughly, on the order of a word or a few characters. Either way, at 126 tok/s the text comes down far faster than anyone can read.
Value quick-reference
| Budget | GPU | Runs | Quality | Recommendation |
|---|---|---|---|---|
| ¥60k–70k | RTX 5060 Ti 8GB | 8B | So-so | Try it out |
| ¥90k | RTX 5060 Ti 16GB | 14B | Good | Best entry |
| ¥80k | RX 9070 16GB | 14B | Good | For Linux users |
| ¥160k | RTX 5070 Ti 16GB | 14B (fast) | Good | Best for doubling up |
| ¥180k | RX 7900 XTX 24GB | 32B | Very good | For serious AI |
| ¥200k | RTX 5080 16GB | 14B (fastest) | Good | Speed-focused |
| ¥400k+ | RTX 5090 32GB | 32B (with room) | Very good | The works |
So which should you actually buy?
If you just want to try it: install LM Studio or Ollama on the PC you already have. It runs on CPU and main memory alone, even without a GPU. Speed drops to roughly a tenth to a twentieth of a GPU, but it’s fast enough to read along as the text appears. That’s plenty to experience “so this is local AI." If it makes you want “faster, smarter models," then look at a GPU — there’s no rush; that order works fine.
* For reference: running an 8B model CPU-only (no GPU) on my PC gave about 8 tok/s (AMD Ryzen 9 3950X / 64GB DDR4 / a CPU released in 2019). That’s about 1/16 of the 126 tok/s with a GPU. In CPU mode the model loads into main memory, so it won’t run if you don’t have enough. An 8B model uses about 5–6GB, so with OS overhead you want at least 16GB of RAM, ideally 32GB+. My PC has a generous 64GB, so there was room to spare, but 8GB PCs may struggle. CPU inference speed also depends on memory bandwidth, so a newer PC with DDR5 should be a bit faster.
Runs even when VRAM is short: partial offload
LLMs have a trait that image-generation AI doesn’t. Image generation (Stable Diffusion, etc.) needs the whole model in VRAM to run, but an LLM can put just part of the model on the GPU and keep the rest in main memory — “partial offload."
For example, you can run a 27B model (which normally needs 18GB+ of VRAM) on a single 12GB GPU. It’s slower, but “better than not running at all" is an option you have.
| Load method | On GPU | VRAM used | Speed |
|---|---|---|---|
| All layers on GPU (2-card split) | All 64 layers | 26.2GB | 25.5 tok/s |
| 30 GPU layers + main memory | 30 of 64 layers | 11.8GB | 2.9 tok/s |
| 15 GPU layers + main memory | 15 of 64 layers | 7.0GB | 2.1 tok/s |
| CPU only | 0 layers | 0GB | 1.7 tok/s |
* Measured on qwen3.5:27b. GPU: RTX 3090 + RTX 3060 / CPU: Ryzen 9 3950X / 64GB DDR4. Measured April 2026.
Getting even half onto the GPU makes it faster than CPU-only (1.7 tok/s) — 2–3 tok/s. But compared with the whole thing on the GPU (25.5 tok/s), it drops to about a tenth, so it’s hard to call comfortable.
Because of this, even when your VRAM is “just barely short," you don’t have to give up on a model over its size. If you can tolerate the speed, you can attempt models beyond your VRAM ceiling. Ollama automatically pushes whatever doesn’t fit in VRAM to main memory, so no special configuration is needed.
To start with a 14B model: the RTX 5060 Ti 16GB (about ¥90k–110k). It’s the most affordable 16GB NVIDIA in this range. But it’s in short supply and prices are climbing, so grab one while stock lasts.
[kimono_product id="15760″]
To run a 32B model: the RX 7900 XTX (24GB / about ¥120k+) is the most realistic on price. Getting 24GB of VRAM for ¥120k is a steal as of April 2026. Note that AMD’s AI stack (ROCm) is Linux-recommended. If you want Windows, look at NVIDIA’s RTX 5080 (16GB / about ¥190k+) or a used RTX 3090 (24GB / ¥130k–200k).
Related articles
- Starting local AI with a used GPU: the value of the RTX 30/40 series
- The full GPU spec list for local AI (2026)
Next steps
Once you’re running local AI, here’s what else you can do:
- AI image generation: run ComfyUI on the same GPU to make images from text
- Voice AI: transcribe with Whisper, read aloud with TTS
- Coding help: use a local LLM as a Copilot replacement via the Continue extension in VS Code
- Combine with VR: talk to AI avatars, put AI to work inside VR spaces
As a foundation for “bridging the virtual world and reality," a home GPU is the most versatile investment there is.
The specs and prices in this article are as of April 2026.