Running a Local AI Chatbot at Home: A Budget-by-Budget Guide

2026年7月5日

I run two GPUs in my main PC and use generative AI locally as well. I use ChatGPT and Claude too, but when I have them summarize work documents, I’ve increasingly caught myself wondering, “Is it really OK to send this outside?" And the monthly fees slowly add up.

So I set out to map, budget by budget, just how far you can take an AI chatbot on nothing but a home GPU.

* This article focuses on consumer GPUs that fit in an ordinary desktop PC (NVIDIA GeForce / AMD Radeon series). It doesn’t cover server/data-center GPUs like the NVIDIA A100 or H100 (40–80GB VRAM, ¥1M+). That’s why the VRAM ceiling here stops at 32GB.

1. Cloud AI vs. local AI
2. Local AI comes down to VRAM
- 2.1. What do “27B" and “8B" mean?
- 2.2. Measured: generation speed by model
3. By GPU brand: which is easiest to get running?
4. Getting started: pick from four apps
- 4.1. How to get started with Ollama (for reference)
5. What changes across Windows, Mac, and Linux?
6. The used-GPU option
7. By budget: what your GPU can do
8. Value-for-money charts by GPU
9. Tokens and text length, roughly
10. Value quick-reference
11. So which should you actually buy?
- 11.1. Runs even when VRAM is short: partial offload
12. Related articles
13. Next steps

Cloud AI vs. local AI

	Cloud AI	Local AI
Privacy	Your conversations are sent to a server	Everything stays on your PC. Nothing leaves
Monthly cost	ChatGPT Plus ¥3,000/mo / Claude Pro ¥3,000/mo	¥0 (electricity only; ~50–150W while the GPU runs)
Upfront cost	¥0	GPU: ¥60k–400k
Total cost over one year	about ¥36,000	GPU + electricity ~¥3,000–6,000/yr
Internet	Required	Not needed (works offline)
Model smarts	The latest models like GPT-4o / Claude 3.5	8B–32B models (depending on your GPU’s VRAM)
Response speed	40–80 tok/s	20–130 tok/s (depending on GPU)

* If you’ll use it for more than a year, even a 16GB GPU (about ¥90k) earns back the monthly cloud-AI fee.

For me the biggest thing is that conversations never leave the machine. Summarizing meeting notes, personal questions — being able to use it without a second thought is local AI’s real strength.

Local AI comes down to VRAM

The thing I really felt after trying local LLMs is that “what you can do" is decided almost entirely by how much VRAM (GPU memory) you have.

What determines a local LLM’s performance (longer bar = bigger impact)

1. VRAM capacity	Sets how large a model you can run (most important)
2. Memory bandwidth	Directly drives how fast text comes out
3. GPU compute	Surprisingly little difference
4. CPU / RAM	Secondary

If you don’t have enough VRAM, you simply can’t run a smart model. Conversely, as long as you have the VRAM, even middling GPU compute runs at practical speed.

What do “27B" and “8B" mean?

In local-AI articles you often see labels like “8B model" or “27B model." The B (Billion) is the model’s parameter count — the “size of its brain," so to speak. Bigger numbers mean a smarter model, but they also eat more GPU memory (VRAM).

Comparing to AIs you already know makes it easier to picture.

Model size	Parameters	VRAM needed	A familiar comparison
2–4B	2–4 billion	~2–4GB	About the level of on-phone AI (Apple Intelligence, Gemini Nano). Can summarize text and handle simple exchanges, but weak on anything intricate
8B	8 billion	~5–6GB	On par with the free ChatGPT’s lightweight model (GPT-4o mini). Practical for everyday chat and simple questions
14B	14 billion	~10–11GB	The line where it starts to surpass the free ChatGPT (GPT-4o mini). Its language gets noticeably more natural. Personally, this is where it becomes genuinely usable
27–32B	27–32 billion	~17–22GB	Quality approaching ChatGPT Plus (GPT-4o class). The “wait, this runs locally?" level
70B+	70 billion+	45GB+	On par with ChatGPT Plus or better. But it won’t run on a single ordinary GPU

* ChatGPT’s models (GPT-4o, etc.) don’t publish exact parameter counts, so this is a felt comparison based on benchmarks. Even at the same parameter count, quality varies a lot with the quality and volume of training data and with tuning.

The relationship between VRAM and model size is simple. A model’s parameters have to sit in the GPU’s VRAM, and if there isn’t enough, that model won’t run. For example, 8GB of VRAM handles up to an 8B model, 16GB up to 14B, and 24GB up to 32B. In other words, the amount of VRAM = the ceiling on model size = the ceiling on how smart your AI can be.

Here are numbers I measured myself.

GPU	Model	Generation speed	VRAM used
RTX 3090 24GB	qwen3.5:27b	★ 25.5 tok/s	18.2GB (split across 2 cards)
RTX 3090 24GB	qwen3:8b	★ 126.4 tok/s	10.3GB
RTX 3060 12GB	qwen3:8b	★ 60.1 tok/s	5.5GB

★ = author-measured values (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula.

My PC has an RTX 3090 and an RTX 3060 in it. On the RTX 3090 (24GB), an 8B model screams along at 126 tok/s. Even the RTX 3060 (12GB) runs an 8B comfortably at 60 tok/s. A 27B model slows down on the 3090 alone for lack of VRAM, but split across two cards it runs practically at 25.5 tok/s. The VRAM gap maps directly onto “how smart a model you can use."

When choosing a GPU, put “how much VRAM does it have" first.

Here’s a table of what you can run and how it performs, by VRAM.

VRAM	Runnable models	Typical models	Speed (approx.)	GPU price range
8GB	8B	Qwen 3 8B, Llama 3.1 8B, Gemma 3 4B	60–130 tok/s	¥60k–70k
12GB	8B–12B	Gemma 3 12B, Qwen 3 8B (with room)	35–130 tok/s	¥50k–80k
16GB	14B	Qwen 3 14B, DeepSeek-R1 14B, Gemma 3 12B	23–72 tok/s	¥80k–160k
24GB	32B	Qwen 3 32B, Gemma 3 27B, DeepSeek-R1 32B	20–35 tok/s	¥180k–250k
32GB	32B + long context	Qwen 3 32B (32K context)	50–60 tok/s	¥400k+

★ = author-measured values (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula.

How to read this table: as VRAM climbs 8GB → 16GB → 24GB, the size (= smarts) of the models you can run steps up. If you want practical everyday quality, 16GB (a 14B model) is the minimum line.

Measured: generation speed by model

How to read this chart: a longer bar means faster generation (= more comfortable). gemma4 is the fastest, but for output quality qwen3.5:27b is the best. Speed and smarts are a trade-off.

[kimono_bar title="" unit="tok/s" color="#1e90ff"]
qwen3.5:27b (3090+3060)|26
qwen3.5:9b (3060)|98.8
qwen3:8b (3090)|127
gemma4:9b (3090)|133
[/kimono_bar]

* Test setup: RTX 3090 (24GB) + RTX 3060 12GB / Linux / Ollama / measured April 2026. The 27b model used a 2-GPU split load.

How much VRAM do you need?

What you want to do	VRAM needed	Model	Speed
Just try out AI	8GB	8B (uses 5–6GB)	60–130 tok/s
Use it for practical everyday work	16GB	14B (uses 10–11GB)	23–72 tok/s
Rely on it seriously for work	24GB	32B (uses 22GB)	20–35 tok/s
The works (AI + VR + image gen)	32GB	32B + long context	50–60 tok/s

* tok/s = tokens generated per second. At 20 tok/s it’s “a slight wait, but readable"; at 40+ tok/s it “comes back instantly."

By GPU brand: which is easiest to get running?

After VRAM, the next thing that matters is “will it actually run on that GPU?" The amount of setup effort varies quite a bit by GPU brand.

GPU brand	Setup	Windows	Mac	Linux
NVIDIA (CUDA)	Just install the driver	◎	–	◎
AMD	Good on Linux. On Windows, AMD’s AI compute stack (ROCm) is still incomplete, so setup takes effort	△	–	○
Apple Silicon	Just install Ollama. Shared memory lets you run large models too	–	◎	–
Intel (iGPU)	Limited support, and on the slow side	△	–	△

The easiest are NVIDIA (Windows/Linux) and Apple Silicon (Mac).

If you’re on Windows or Linux like me, you can’t go wrong choosing an NVIDIA GPU. Just install the driver and Ollama auto-detects it.

AMD’s appeal is that you can buy the same VRAM cheaper than NVIDIA, but on Windows the software stack for AI (ROCm) is still incomplete and takes fiddling to set up. It’s not yet “install the driver and it works" the way NVIDIA’s CUDA is. If you’re prepared to run Linux, the value for money is unbeatable.

* ROCm = AMD’s software stack for running AI on its GPUs, equivalent to CUDA on NVIDIA. NVIDIA’s CUDA has years of proven stability, while AMD’s ROCm is still maturing and support is limited, especially on Windows.

For Mac users, Apple Silicon’s unified memory is a surprising strength. With 24GB or more, you can run 32B-class models. Speed lags a dedicated NVIDIA GPU, but “a 32B running on a laptop" is a pretty interesting experience.

Getting started: pick from four apps

There are several apps for running local LLMs. I use Ollama, but the best choice is whatever suits you.

Local LLM apps compared

App	What it’s like	Best for	OS
LM Studio	Everything from model search to chat in a GUI. The most approachable	First-timers	Win/Mac/Linux
Ollama + Open WebUI	Set up from the command line; add a browser UI with Open WebUI	People who want to build their own setup	Win/Mac/Linux
Jan	Privacy-focused. A self-contained desktop app	People who want it simple	Win/Mac/Linux
GPT4All	Lightweight. Few settings, so nothing to get lost in	People who just want a quick try	Win/Mac/Linux

My personal take: LM Studio to start, Ollama + Open WebUI once you’re in deep.

With LM Studio, you can search, download, and chat with a model right after installing, so if you’re not used to the terminal it’s the easier way in.

I chose Ollama for the nimbleness of switching between models from the command line and for its extensibility, which suit my taste. Day to day, I chat with it from a terminal app.

How to get started with Ollama (for reference)

Download the installer from ollama.com
Install it (Windows / Mac / Linux)
Type ollama run qwen3:8b in the terminal
Chat begins

On my setup, it auto-detected the GPU right after install and just worked. I never had to fuss with detailed settings.

On my machine (RTX 3090), qwen3:8b generates at about 126 tok/s. It feels like “the reply starts the instant I hit enter." On the RTX 3060 it’s 60 tok/s — the bandwidth gap shows up directly as speed, but it still feels plenty comfortable.

What changes across Windows, Mac, and Linux?

The experience differs quite a bit by OS, so let me lay it out.

OS	Pros	Cons	Best for
Windows	With NVIDIA, setup is the easiest. Plenty of GUI apps like LM Studio too	Slightly more VRAM overhead than Linux. AMD GPUs take effort to set up	NVIDIA GPU owners who want an easy start
Mac	Apple Silicon’s unified memory runs large models. Power-efficient	Slower generation than a dedicated GPU. Pricey hardware	People whose main machine is a Mac; people who want portability
Linux	The most memory-efficient. AMD’s AI stack (ROCm) runs stably on Linux too. Easy to run with Docker	Requires technical know-how to set up	AMD GPU owners; people who want to run it server-style

For beginners or first-timers, my suggestions are:

Windows users → NVIDIA GPU

Mac users → lean on Apple Silicon

Linux users → AMD GPUs come into play too

That’s roughly how it shakes out.

I run mine on Linux with an RTX 3090 + RTX 3060 in tandem. I can run Ollama (chat AI) on one and ComfyUI (image generation) on the other at the same time, and I’m quite fond of this setup.

The used-GPU option

New isn’t the only option. My RTX 3090 was bought at launch for about ¥300k at list price; my secondary RTX 3060 12GB was about ¥40k used.

The two best values on the used market are:

GPU	VRAM	Used price (shops)	Runs	Notes
RTX 3060 12GB	12GB	¥20k–35k	8B models	Cheapest entry point. 12GB for around ¥20k
RTX 4060 Ti 16GB	16GB	¥70k–100k	14B models	A hidden gem. 16GB at half the new price
RTX 3090 24GB	24GB	¥130k–200k	32B models	Staying high on AI demand

Note: the RTX 30 series is a generation where many cards were run hard during the mining boom. That said, the RTX 3060 12GB shipped with a mining limiter (LHR) from the start, and its 12GB of VRAM wasn’t needed for mining, so heavily-abused units are relatively rare. The RTX 3080/3090, by contrast, were popular for mining, so take more care. I’d recommend buying from a used shop with a warranty.

By budget: what your GPU can do

From here I’ll break down, by concrete budget tier, which GPU runs what. As noted above, the top criterion is “how many GB of VRAM," and the next is “is it NVIDIA?" I’ve organized this around new-card prices; if you’re also considering used, see the comparison table above.

¥60k–70k tier (RTX 5060 / RTX 5060 Ti 8GB)

[kimono_product id="15770″]

What you can do with 8GB of VRAM:

Task	Doable?	How it feels
Everyday Q&A (weather, cooking, small talk)	◎	Plenty practical
Simple coding help	○	OK for short snippets
Proofreading text	○	Decent even on an 8B
Summarizing long text (papers, minutes)	△	Short context (2K–4K tokens)
Complex reasoning / analysis	△	The limit of an 8B model
Translation	○	OK for simple sentences

Runnable models:

Model	VRAM used	Speed (approx.)	Quality
Qwen 3 8B	~5.2GB	65 tok/s	Decent
Llama 3.1 8B	~6.2GB	56 tok/s	Better in English
Gemma 3 4B	~3.6GB	112 tok/s	Basic

★ = author-measured (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula, using the RTX 5060 Ti 8GB (448 GB/s) as the representative GPU.

Enough to experience “so this is what AI is like." But quality is “so-so," and it tends to lose the thread in long conversations. Ideal as a “try it out," but too shaky to rely on for work.

Value: ★★★☆☆ (fine for trying it out)

[kimono_product id="15770″]

¥90k–110k tier (RTX 5060 Ti 16GB / RX 9070)

[kimono_product id="15760″]

What you can do with 16GB of VRAM:

Task	Doable?	How it feels
Everyday Q&A	◎	Comfortable
Coding help (moderate)	◎	Practical at the function level
Proofreading and rewriting	◎	A 14B’s language is quite good
Summarizing long text	○	Up to 8K–16K tokens
Drafting emails	◎	Practical
Technical Q&A	○	As deep as a 14B gets
Drafting fiction or blog posts	○	Usable as a first draft

Runnable models:

Model	VRAM used	Speed (approx.)	Quality	Notes
Qwen 3 14B	~10.7GB	36–72 tok/s	Good	A notch better in language. Personally, “usable" starts here
Gemma 3 12B	~12.4GB	27–54 tok/s	Good	Google’s 12B. A balanced pick
DeepSeek-R1-Distill 14B	~11GB	31–61 tok/s	Fairly good	Strong at reasoning (thinks before answering)

★ = author-measured (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula, estimated across the bandwidth range from RTX 5060 Ti 16GB (448 GB/s) to RTX 5070 Ti (896 GB/s).

This is the “entrance to practical use." A 14B is clearly smarter than an 8B by feel — naturalness of language, grasp of the question, and accuracy of summaries are on another level. This is the line where you start thinking, “maybe I can drop the paid ChatGPT subscription and get by with this."

That said, the RTX 5060 Ti 16GB has a 128-bit bus, so token generation is slower than higher-end GPUs. Think of it as “a smart friend who talks a little slowly."

The AMD RX 9070 (16GB / about ¥80k) is the cheapest per gigabyte of VRAM, but AMD’s AI stack isn’t as mature as NVIDIA’s. On Windows, setup can take an extra step.

Value: ★★★★☆ (the best-balanced entry to practical use)

[kimono_product id="15760″]

¥160k tier (RTX 5070 Ti 16GB)

[kimono_product id="15762″]

What you can do with 16GB of VRAM (fast):

You can do the same things as the 16GB tier, but the speed is different.

Comparison	RTX 5060 Ti 16GB	RTX 5070 Ti 16GB
Qwen 3 14B speed	~23 tok/s	~72 tok/s
How it feels	“A slight wait"	“Comes back instantly"
Doubling as AI image gen	A bit slow	Comfortable
Doubling as VR	Entry level	Comfortable

The most comfortable of the 16GB options. If you also want VR or AI image generation, the ¥60k premium over the 5060 Ti is well worth it. “Overkill for local AI alone, ideal if you’re doubling up with other uses."

Value: ★★★★☆ (best if you’re doubling up)

[kimono_product id="15762″]

¥120k–300k tier (RX 7900 XTX 24GB / RTX 5080 16GB)

[kimono_product id="15771″]
[kimono_product id="15763″]

This is where “serious local AI" begins.

What you can do with 24GB of VRAM (RX 7900 XTX):

Task	Doable?	How it feels
Everything above	◎	Comfortable
32B models (Qwen 3 32B, etc.)	◎	Surprisingly “smarter than expected"
Analyzing / summarizing long text	◎	16K–32K tokens is practical
Cross-document analysis	○	Doable, but slower
Coding help (whole files)	◎	A 32B’s code comprehension is high
Specialized Q&A	◎	Solid accuracy on medicine, law, tech, and more

Runnable models:

Model	VRAM used	Speed (approx.)	Quality	Notes
Qwen 3 32B	~22.2GB	32 tok/s	Very good	The “this runs locally?" level
Gemma 3 27B	~22.5GB	41 tok/s	Very good	Google’s large model
DeepSeek-R1-Distill 32B	~22GB	32 tok/s	Good	Deep reasoning chains

★ = author-measured (RTX 3090 / RTX 3060, April 2026). Others are estimates from the estimation formula, using RTX 3090 (936 GB/s) bandwidth for the estimate.

A 32B model changes everything. Up to 14B it was “AI-ish, but, well, about what you’d expect"; a 32B brings the “wait, this is running locally?" surprise. Language quality, reasoning depth, and context retention are on another level.

The RX 7900 XTX (24GB / about ¥120k–150k) blows past NVIDIA on price per gigabyte of VRAM, but running AI stably calls for a Linux environment. On Windows, be ready for some configuration.

The RTX 5080 (16GB / about ¥190k–300k) is top-class in speed, but with only 16GB of VRAM it can’t run 32B models. “A fast 14B" or “a VRAM-rich 32B" — this is the biggest fork in the road.

Value: ★★★★★ (the best value tier if you’re serious about local AI)

[kimono_product id="15763″]

[kimono_product id="15771″]

¥400k–610k tier (RTX 5090 32GB)

[kimono_product id="15772″]

What you can do with 32GB of VRAM:

Run 32B models comfortably at very long context (32K+ tokens). Even 32GB isn’t enough for 70B models (which need 45GB+).

Overkill to buy purely for local AI. It’s for the “the works" crowd who want to do VR (120Hz max settings) + AI image generation (FLUX Dev) + local LLM (32B) all on one card.

Value: ★★☆☆☆ (makes sense for the works, too expensive for AI alone)

[kimono_product id="15772″]

Value-for-money charts by GPU

Local LLM value ranking

How to read this chart: a longer bar means higher performance for the price — better value. The value metric is “practical performance score ÷ price (in ¥10k units)."

* How the practical performance score (out of 100) is computed: 50% for the ceiling model size you can run (VRAM-dependent), 30% for generation speed (bandwidth-dependent), and 20% for context-length headroom (VRAM-headroom-dependent), weighted and summed. Dividing that score by GPU price (in ¥10k units) gives the value metric. The bigger the number, the more performance per ¥10k.

What the chart tells us

The RX 9070 (16GB / ¥80k) is the best value. But AMD’s AI stack (ROCm) is Linux-recommended, and on Windows setup takes effort
Among NVIDIA cards, the RTX 5060 Ti 16GB (¥90k–110k) is the value champion. Getting 16GB of VRAM for about ¥90k–110k is the cheapest line for running a 14B model practically
The RTX 5080 (¥190k–300k) and RTX 5090 (¥400k–610k) are poor value. Performance is high but so is the price, so the metric comes out low. They’re for people with budget to spare, or who double up with non-AI uses (VR, gaming)
The RTX 5090 (¥400k–610k) is for “the works." Too expensive to buy for LLMs alone, but it makes sense if you’re combining VR + image gen + LLM

Model value ranking (quality per VRAM)

How to read this chart: a longer bar means “higher quality for less VRAM" — better value. The metric is “quality (5-point scale) ÷ required VRAM (GB) × 10." The ★ 14B models (Qwen 3 14B, DeepSeek-R1 14B) are the practical line. Models above them have higher quality but need more VRAM, so their value metric drops.

* The 14B models need about 10–11GB of VRAM and score 4.0/5.0 on quality. Models of 8B and under need less VRAM, so their value metric looks high, but their quality is “so-so." Consider the absolute quality level, not just the value metric. Personally, I feel 14B and up is the minimum line for practical quality.

Tokens and text length, roughly

Throughout the article I use the unit “tok/s" (tokens per second). A token is a chunk of text — very roughly, on the order of a word or a few characters. Either way, at 126 tok/s the text comes down far faster than anyone can read.

Value quick-reference

Budget	GPU	Runs	Quality	Recommendation
¥60k–70k	RTX 5060 Ti 8GB	8B	So-so	Try it out
¥90k	RTX 5060 Ti 16GB	14B	Good	Best entry
¥80k	RX 9070 16GB	14B	Good	For Linux users
¥160k	RTX 5070 Ti 16GB	14B (fast)	Good	Best for doubling up
¥180k	RX 7900 XTX 24GB	32B	Very good	For serious AI
¥200k	RTX 5080 16GB	14B (fastest)	Good	Speed-focused
¥400k+	RTX 5090 32GB	32B (with room)	Very good	The works

So which should you actually buy?

If you just want to try it: install LM Studio or Ollama on the PC you already have. It runs on CPU and main memory alone, even without a GPU. Speed drops to roughly a tenth to a twentieth of a GPU, but it’s fast enough to read along as the text appears. That’s plenty to experience “so this is local AI." If it makes you want “faster, smarter models," then look at a GPU — there’s no rush; that order works fine.

* For reference: running an 8B model CPU-only (no GPU) on my PC gave about 8 tok/s (AMD Ryzen 9 3950X / 64GB DDR4 / a CPU released in 2019). That’s about 1/16 of the 126 tok/s with a GPU. In CPU mode the model loads into main memory, so it won’t run if you don’t have enough. An 8B model uses about 5–6GB, so with OS overhead you want at least 16GB of RAM, ideally 32GB+. My PC has a generous 64GB, so there was room to spare, but 8GB PCs may struggle. CPU inference speed also depends on memory bandwidth, so a newer PC with DDR5 should be a bit faster.

Runs even when VRAM is short: partial offload

LLMs have a trait that image-generation AI doesn’t. Image generation (Stable Diffusion, etc.) needs the whole model in VRAM to run, but an LLM can put just part of the model on the GPU and keep the rest in main memory — “partial offload."

For example, you can run a 27B model (which normally needs 18GB+ of VRAM) on a single 12GB GPU. It’s slower, but “better than not running at all" is an option you have.

Load method	On GPU	VRAM used	Speed
All layers on GPU (2-card split)	All 64 layers	26.2GB	25.5 tok/s
30 GPU layers + main memory	30 of 64 layers	11.8GB	2.9 tok/s
15 GPU layers + main memory	15 of 64 layers	7.0GB	2.1 tok/s
CPU only	0 layers	0GB	1.7 tok/s

* Measured on qwen3.5:27b. GPU: RTX 3090 + RTX 3060 / CPU: Ryzen 9 3950X / 64GB DDR4. Measured April 2026.

Getting even half onto the GPU makes it faster than CPU-only (1.7 tok/s) — 2–3 tok/s. But compared with the whole thing on the GPU (25.5 tok/s), it drops to about a tenth, so it’s hard to call comfortable.

Because of this, even when your VRAM is “just barely short," you don’t have to give up on a model over its size. If you can tolerate the speed, you can attempt models beyond your VRAM ceiling. Ollama automatically pushes whatever doesn’t fit in VRAM to main memory, so no special configuration is needed.

To start with a 14B model: the RTX 5060 Ti 16GB (about ¥90k–110k). It’s the most affordable 16GB NVIDIA in this range. But it’s in short supply and prices are climbing, so grab one while stock lasts.

[kimono_product id="15760″]

To run a 32B model: the RX 7900 XTX (24GB / about ¥120k+) is the most realistic on price. Getting 24GB of VRAM for ¥120k is a steal as of April 2026. Note that AMD’s AI stack (ROCm) is Linux-recommended. If you want Windows, look at NVIDIA’s RTX 5080 (16GB / about ¥190k+) or a used RTX 3090 (24GB / ¥130k–200k).

Next steps

Once you’re running local AI, here’s what else you can do:

AI image generation: run ComfyUI on the same GPU to make images from text
Voice AI: transcribe with Whisper, read aloud with TTS
Coding help: use a local LLM as a Copilot replacement via the Continue extension in VS Code
Combine with VR: talk to AI avatars, put AI to work inside VR spaces

As a foundation for “bridging the virtual world and reality," a home GPU is the most versatile investment there is.

The specs and prices in this article are as of April 2026.

▶ Go deeper on local AI (related)

Getting Started,Local AI

Cloud AI vs. local AI

Local AI comes down to VRAM

What do “27B" and “8B" mean?

Measured: generation speed by model

By GPU brand: which is easiest to get running?

Getting started: pick from four apps

How to get started with Ollama (for reference)

What changes across Windows, Mac, and Linux?

The used-GPU option

By budget: what your GPU can do

¥60k–70k tier (RTX 5060 / RTX 5060 Ti 8GB)

¥90k–110k tier (RTX 5060 Ti 16GB / RX 9070)

¥160k tier (RTX 5070 Ti 16GB)

¥120k–300k tier (RX 7900 XTX 24GB / RTX 5080 16GB)

¥400k–610k tier (RTX 5090 32GB)

Value-for-money charts by GPU

Local LLM value ranking

What the chart tells us

Model value ranking (quality per VRAM)

Tokens and text length, roughly

Value quick-reference

So which should you actually buy?

Runs even when VRAM is short: partial offload

Related articles

Next steps