One GPU for VR, AI and Image Generation? Picking by Use-Case Mix (2026)

2026年7月5日

I want one GPU to handle local LLMs, AI image generation, and VR all at once. On a dual-card setup with an RTX 3090 and an RTX 3060, I measured the VRAM use of each task and sorted out which GPU to recommend for each combination of uses.

The bottom line: the RTX 5070 Ti 16GB (about ¥160k) is the most realistic single-card choice for multiple uses.

The background knowledge for each use (how VRAM works, an intro to each task) is collected in separate articles.

Prerequisite reading for this article
・Local LLM basics → Running a local AI chatbot at home
・AI image generation basics → I want to generate AI images
・VR basics → Going full-body tracking in VRChat
・Full GPU spec comparison → The full GPU spec list

Contents

1. How VRAM use differs by task
2. Recommended GPUs by use-case combination
- 2.1. How to read this table
3. “Simultaneous use" vs “alternating use" changes the VRAM requirement
4. Concrete picks by combination
5. If you can’t narrow it to one card: split uses across two cards
- 5.1. Cases where dual cards work well
- 5.2. Cautions for dual cards
6. Summary: if you’re going single-card, these three
- 6.1. GPUs mentioned in this article

How VRAM use differs by task

To judge whether one card can cover everything, you need to know how each task uses VRAM.

Task	VRAM use pattern	Key point
Local LLM (Ollama, etc.)	The whole model stays resident in VRAM. Occupied the entire time you use it	Model size = required VRAM. 10–12GB for 14B, 20–23GB for 32B
AI image generation (ComfyUI, etc.)	Heavy use only during generation. Mostly freed when done	8–12GB for SDXL+ControlNet, 16–24GB for FLUX Dev
VR gaming	Frame buffer + textures. 8–12GB is enough	GPU compute power and encoder quality matter. VRAM rarely becomes the bottleneck
3D modeling (Blender)	Depends on scene complexity. 4–8GB for modeling, 12GB+ for Cycles rendering	OptiX ray tracing is NVIDIA-only
3D scanning / Gaussian Splatting	12GB+ recommended for training. Viewing is possible at 8GB	Many tools require CUDA

Deep dive: VRAM capacity isn’t the only thing that decides LLM inference speed

An LLM’s inference speed (token generation rate) depends more strongly on VRAM bandwidth (Memory Bandwidth) than on VRAM capacity. That’s because every time it generates one token, it has to read the model’s entire weights out of VRAM.

As a rough rule, “model size (GB) ÷ VRAM bandwidth (GB/s) = the minimum time per token (seconds)." For example, running a 14B model (about 8GB in Q4 quantization) on an RTX 5070 Ti (bandwidth 896GB/s) gives, in theory, 8÷896 ≈ 0.009 s/token — an upper limit of roughly 110 tokens per second. The RTX 3090 (bandwidth 936GB/s) has a slightly wider bandwidth figure, but with the generational gap in Tensor cores (3rd gen vs 5th gen), in measurements there are cases where the 5070 Ti comes out faster.

A “it fits in VRAM but it’s slow" situation is, in most cases, bandwidth becoming the bottleneck.

Deep dive: the division of labor between Tensor cores and CUDA cores

A GPU has two kinds of compute units.

CUDA cores are general-purpose compute units that handle all kinds of floating-point math — rendering VR games, 3D modeling, and so on. They’re the “GPU’s basic fitness."

Tensor cores are dedicated units specialized for AI processing, accelerating matrix multiplication. Both LLM inference and image-generation diffusion are, at their core, large-scale matrix computation. Tensor cores handle this at tens of times the efficiency of ordinary CUDA cores.

The RTX 50 series carries 5th-gen Tensor cores (with FP4/FP8 support), a big improvement in AI-processing efficiency over the RTX 30 series’ 3rd gen. The reason the RTX 5090 and RTX 3090 differ by more than 2x in LLM inference speed is that, on top of the bandwidth difference, this generational gap in Tensor cores is at work.

VRAM usage by use case (rough guide)

Local LLM 8B

6 GB

Local LLM 14B

10 GB

AI image gen SDXL

10 GB

AI image gen FLUX Dev

20 GB

VR gaming

10 GB

3D modeling (Blender)

8 GB

3D scan / Gaussian Splatting

12 GB

When used one at a time. If run together, the values must be added up.

VRAM bandwidth by GPU

RTX 5090 32GB

1792 GB/s

RTX 5080 16GB

960 GB/s

RTX 5070 Ti 16GB

896 GB/s

RTX 5070 12GB

672 GB/s

RTX 5060 Ti 16GB

448 GB/s

RTX 3090 24GB

936 GB/s

RTX 3060 12GB

360 GB/s

Wider bandwidth means faster LLM inference and AI image generation. Rated values.

Recommended GPUs by use-case combination

Here’s the main event. By the combination of things you want to do, I’ve sorted out the GPU you need to cover them all on a single card.

What you want to do	Min VRAM	Recommended GPU	Why
LLM + image generation	16GB+	RTX 5070 Ti 16GB	Both eat VRAM. If you use a 14B model + SDXL alternately, 16GB is enough
LLM + VR	12GB is OK	RTX 5070 12GB	VR doesn’t eat much VRAM. You also rarely use LLM and VR at the same time
Image generation + VR	12GB+	RTX 5070 12GB	SDXL-centric is comfortable at 12GB. If you go as far as FLUX Dev, 16GB
LLM + image generation + VR	16GB+	RTX 5070 Ti 16GB	The realistic minimum line to cover all three uses on one card
Everything (the above + 3D + scanning)	24GB	RTX 5090 32GB / used RTX 3090 24GB	Including Cycles rendering + 3DGS training, you want 24GB

How to read this table

From the left column, find the combination of things you want to do. The VRAM figures assume “alternating use." The case of using them at the same time is explained in the next section.

“Simultaneous use" vs “alternating use" changes the VRAM requirement

A GPU’s VRAM is one shared pool. If multiple tasks use VRAM at the same time, they contend for it.

Alternating use (close one before launching the other)

Since an app frees VRAM when it’s done, you only need enough for the single most VRAM-hungry task.

Example: 14B model in Ollama (~10GB used) → close it → SDXL in ComfyUI (10GB used)
VRAM needed: 12GB (only the larger one matters)

Simultaneous use (both launched at once)

You need the VRAM as a sum. When it runs short, it spills over into main memory (system RAM). VRAM bandwidth is about 900GB/s versus roughly 50GB/s for system RAM, so speed drops to about 1/18. It’s a “runs, but useless" state.

Example: 14B model in Ollama (~10GB resident) + SDXL in ComfyUI (10GB)
VRAM needed: 22GB (summed) → 16GB isn’t enough

Note: Ollama keeps models resident in VRAM by default. When switching to another task, it’s handy to free the VRAM with ollama stop model-name, or set OLLAMA_KEEP_ALIVE=0 to free it automatically.

VRAM guide assuming simultaneous use

Combination	Alternating	Simultaneous
LLM (14B) + image generation (SDXL)	12GB	22GB
LLM (8B) + VR	8GB	12GB
LLM (14B) + VR	12GB	16GB
Image generation (SDXL) + VR	12GB	16GB
LLM (14B) + image generation + VR	16GB	28GB (unrealistic)

Using three or more at the same time isn’t realistic. Using them alternately, or splitting uses across two cards (below), is the safer bet.

Concrete picks by combination

Pattern A: LLM + image generation (no VR)

VRAM is the top priority. GPU compute power can be middling.

Under ¥100k: RTX 5060 Ti 16GB (about ¥90k). With a 128-bit bus and a modest 448GB/s bandwidth, image generation is about half the speed of the RTX 5070 Ti (896GB/s). But its 16GB of VRAM lets you use a 14B model + SDXL alternately. A capacity-over-speed choice
¥160k: RTX 5070 Ti 16GB. Both speed and VRAM. The practical best balance
Keeping costs down: used RTX 3090 24GB (about ¥130–180k). 24GB of VRAM and 936GB/s of bandwidth are still strong today. But at 350W (50W more than the RTX 5070 Ti’s 300W), its AI-processing power efficiency is about 60% of the RTX 50 generation. You’ll need to factor in electricity cost and heat management

Pattern B: LLM + VR (image generation now and then)

VR needs GPU compute power and an NVENC encoder. LLMs need VRAM. Meeting both requires a mid-range card or better.

¥100k: RTX 5070 12GB. Comfortable VR at 90Hz + an 8B model for everyday use. Image generation is fine too, as long as it’s SDXL
¥160k: RTX 5070 Ti 16GB. VR at 90Hz with room to spare + a 14B model for everyday use. Image generation is comfortable too

Pattern C: wanting to do everything (LLM + image generation + VR + 3D)

To cover it all on a single card, you have to decide where to compromise.

GPU	Price range	What it can and can’t do
RTX 5070 Ti 16GB	~¥160k	14B model, SDXL, VR at 90Hz, mid-size Blender scenes. FLUX Dev and 32B models are rough
RTX 5080 16GB	~¥200k	The above + VR at 120Hz, large Blender scenes. VRAM is the same 16GB as the 5070 Ti, so the LLM ceiling doesn’t change
RTX 5090 32GB	~¥400k and up (official price; street price is spiking to around ¥600k)	32B model, FLUX Dev, VR at 120Hz, large-scale Cycles rendering. Everything on one card, but expensive

Hands-on: I run a dual-card setup with an RTX 3090 (24GB) and an RTX 3060 (12GB). Trying to do everything on one card, you start wanting 24GB or more — and in the current generation, the only 24GB+ option is the RTX 5090 (32GB, about ¥400k). If that’s tough on the budget, consider splitting across two cards.

If you can’t narrow it to one card: split uses across two cards

“Everything on one card" is the ideal, but given the reality of budget and VRAM, there are cases where splitting across two cards is more sensible.

Cases where dual cards work well

A used RTX 3090 (24GB) for LLMs + an RTX 5070 (12GB) for VR/image generation
A large-VRAM card dedicated to LLMs + a single card handling everything else
Specify the GPU to use with Ollama’s CUDA_VISIBLE_DEVICES to prevent VRAM contention

Deep dive: GPU-to-GPU communication with dual cards — PCIe vs NVLink

When you split-load a model across two GPUs in Ollama, the data-transfer speed between the GPUs affects inference speed.

PCIe 4.0 x16 has a one-way bandwidth of about 32GB/s. On a typical desktop dual-card setup, this is the ceiling. For a 14B-class model the practical impact is small, but split a 70B+ model across two cards and GPU-to-GPU communication becomes the bottleneck, dropping token generation speed by 30–50% in some cases.

NVLink is a dedicated interconnect that directly links GPUs, with bandwidth several to a dozen-plus times that of PCIe. But among consumer GPUs, the RTX 3090 was the last generation to support NVLink; it was dropped in the RTX 40/50 series.

For a consumer dual-card setup, “split GPUs by task" is the most efficient way to use them. Splitting a single model across two cards is practical up to about 14B, but beyond that you have to accept the speed drop.

Cautions for dual cards

You need power-supply capacity (850W+ recommended), physical PCIe slot layout, and adequate heat dissipation
Ollama can split-load a model across two GPUs (running a large model on the summed VRAM)

For those who want the details
・How to choose a used GPU and what to watch for → Starting local AI with a used GPU
・Concrete dual-card setup steps → Getting the most out of local AI with two GPUs

Summary: if you’re going single-card, these three

Budget	GPU	VRAM bandwidth	TDP	Suited combination
¥100k	RTX 5070 12GB	672 GB/s	250W	LLM (8B) + VR + image generation (SDXL)
¥160k	RTX 5070 Ti 16GB	896 GB/s	300W	LLM (14B) + VR + image generation (SDXL/FLUX Schnell)
¥400k+	RTX 5090 32GB	1,792 GB/s	575W	Covers every use with no compromise

The ¥160k RTX 5070 Ti is the most realistic balance for handling multiple uses on a single card. When that runs short, then consider dual cards — that order is also the most cost-sensible.

Related