Dual-GPU Local AI: Running RTX 3090 + RTX 3060 Together

2026年7月5日

I run local AI on a PC with two GPUs. I pair an LLM for chat in Ollama with image generation in ComfyUI, or fire up several LLMs at once. In this article I’ve organized, using measured data, whether there’s really any benefit to running two GPUs.

The conclusion: two GPUs are absolutely worth it. But the expectation that “VRAM becomes 24+12 = 36GB" is wrong. The real value of a dual-GPU setup lies in “running tasks simultaneously" and “Ollama’s automatic splitting." And when I measured parallel processing, the results far exceeded what I expected.

All the numbers in this article are measured on my own setup (RTX 3090 + RTX 3060 12GB, Linux, Ollama), as of April 2026.

1. What you can do with two GPUs
2. VRAM isn’t unified, but Ollama splits automatically
3. Measured data: baseline performance
4. The stunning parallel-test results
5. The parallel-processing trade-off: context length
- 5.1. qwen3:8b on the RTX 3090: prompt length vs speed
- 5.2. The multi-turn conversation case
6. What happens when you hit the context limit
- 6.1. The 27B model: concurrency vs context length
7. Chart: concurrency vs total throughput
8. Cost comparison: local AI vs cloud AI
- 8.1. Annual cost comparison (8-person use)
9. Setup steps
10. Summary
- 10.1. GPUs used in this article

What you can do with two GPUs

With two GPUs installed, there are broadly three usable patterns.

Pattern 1: Separating tasks
Run an 8B chat model in Ollama on GPU0 (RTX 3060) while running ComfyUI image generation on GPU1 (RTX 3090) at the same time. This is my everyday usage. Being able to chat while waiting on an image makes work far more efficient.

Pattern 2: Running models simultaneously
Spin up an 8B model on GPU0 (RTX 3060) and a 27B model on GPU1 (RTX 3090) at once. Throw light questions at the 8B and involved consultations at the 27B. Task-appropriate use is self-contained on a single PC.

Pattern 3: Splitting a large model across GPUs
Ollama automatically splits a model that won’t fit in 24GB across the two GPUs. I explain this in detail in the next section.

One important point. VRAM is not unified. The RTX 3090’s 24GB and the RTX 3060’s 12GB do not combine into 36GB. Each GPU has its own independent memory space. There are cases where connecting with NVLink lets you share a VRAM pool, but NVLink isn’t available for the RTX 3090 + RTX 3060 combination.

VRAM isn’t unified, but Ollama splits automatically

If VRAM isn’t unified, does that mean a model over 24GB won’t run? The answer is “with Ollama, it will."

For example, qwen3.5:27b needs about 17.4GB of VRAM for the whole model. It fits in the RTX 3090’s 24GB, but not in the RTX 3060’s 12GB. So what happens when you load it in a dual-GPU environment?

Ollama automatically distributes the model’s layers across multiple GPUs. For qwen3.5:27b, it split-loaded as 17.9GB on the RTX 3090 and 8.4GB on the RTX 3060 — 26.2GB total. You don’t specify this in a config; Ollama decides automatically by looking at the free VRAM.

Thanks to this mechanism, even a huge model that won’t fit on one GPU will run if you split it across two. That said, since data has to move between the two GPUs, there’s overhead compared to fitting everything on one card. Just how much overhead? The next benchmark shows it.

Hands-on: Ollama’s automatic splitting really is “automatic." As long as both GPUs are recognized, you don’t need to configure any split. Just type ollama run qwen3.5:27b and it loads with the optimal distribution on its own.

Measured data: baseline performance

My PC has an RTX 3090 and an RTX 3060 installed.

Here’s the baseline performance data for each GPU and model.

GPU	Model	VRAM used	Generation speed (tok/s)	Cold start	System power draw
RTX 3090	gemma4	21.6GB	133.0	11.0s	~299W
RTX 3090	qwen3.5:27b	18.2GB (split)	25.5	–	~258W
RTX 3090	qwen3:8b	10.3GB	126.4	2.1s	~295W
RTX 3060	qwen3:8b	5.5GB	60.1	2.1s	~170W
RTX 3060	qwen3.5:9b	7.9GB	46.6	7.2s	~337W
Simultaneous	3060:8B + 3090:27B	8.4+18.2GB	119+25.5 tok/s	–	~374W

★ = author-measured (RTX 3090 / RTX 3060, April 2026). For how estimates are calculated, see the full GPU spec list.

A few things stand out.

qwen3:8b runs at 126.4 tok/s on the RTX 3090 and 60.1 tok/s on the RTX 3060. The bandwidth difference (936 vs 360 GB/s) shows up in the speed. Even the RTX 3060’s 60 tok/s is plenty comfortable, but the RTX 3090 is twice as fast. With two GPUs, an efficient setup puts the main work on the RTX 3090 and runs subtasks on the RTX 3060.

gemma4’s 133.0 tok/s feels “instant." On my setup I barely notice any practical response lag. As a rough feel for tok/s: 15 or below is slow (you wait), 30 is comfortable, and 40+ comes back right away. At 133 tok/s, the text starts pouring in the moment you finish typing.

When running simultaneously, the 27B model’s speed (25.5 tok/s) is unchanged from running it alone. The 8B side dips slightly from 126 to 119 tok/s, but you don’t notice it. In other words, running two completely different models at once causes almost no practical performance loss.

The stunning parallel-test results

Method: Using curl, I sent N simultaneous requests to the Ollama API (localhost:11434/api/generate) and computed tok/s from each response’s eval_count / eval_duration. I used the median of three runs. The prompt was a fixed text (about 300 Japanese characters, num_predict=256).

Here’s the heart of the article. I measured “what happens when you throw multiple requests at a single model at once."

Parallel test of qwen3:8b on the RTX 3090

Concurrency	tok/s per request	Total throughput (tok/s)	Slowdown
1	126.4	126	baseline
8	125.8	1,006	-0.5%
16	127.2	2,035	+0.6%
32	125.7	4,021	-0.6%
64	125.5	8,034	-0.7%
128	125.6	16,081	-0.6%

★ Measured on the RTX 3090 (24GB). Baseline is the single measured value, 126.4 tok/s. Measured April 2026.

This result surprised me. Even at 128-way concurrency, the per-request speed barely drops. Total throughput grows from 126 tok/s to 16,081 tok/s — a 127x increase. It’s near-perfect linear scaling.

Parallel test of qwen3.5:27b (split across RTX 3090 + RTX 3060)

Concurrency	tok/s per request	Total throughput (tok/s)	Slowdown
1	25.5	26	baseline
4	26.1	105	none
8	26.2	209	none

★ Measured with a 2-GPU split across the RTX 3090 + RTX 3060. Measured April 2026.

The 27B model shows the same trend. No slowdown even at 8-way concurrency. Even split across two GPUs, the parallel-processing scaling doesn’t break down.

Why does this happen?

This “concurrency goes up but speed doesn’t fall" behavior makes sense once you understand how LLM inference works.

The model weights (parameters) are loaded onto the GPU just once. 10.3GB of model data is the same 10.3GB whether it’s 1 request or 128. What grows is only each request’s KV cache (the memory that holds the conversation context), and for short conversations that stays around a few MB per request.

The bottleneck in LLM inference is not the GPU’s compute power but memory bandwidth. The speed of reading the model weights from VRAM is the bottleneck. Even with multiple requests, the weights only need to be read once, so as concurrency rises, the per-request cost barely increases.

Hands-on: This result means “even if eight family members use local AI at once, it feels the same as one person using it." In my house, family members really do access it simultaneously through Open WebUI, and no one complains about speed.

The parallel-processing trade-off: context length

I wrote that “raising concurrency doesn’t slow things down," but there is a factor that does slow it down. Not concurrency — the length of the prompt (input text).

qwen3:8b on the RTX 3090: prompt length vs speed

Prompt length	Approx. Japanese chars	Generation speed (tok/s)	Change
57 tok	~60 chars	127.8	baseline
381 tok	~400 chars	126.3	-1.2%
1,821 tok	~1,800 chars	119.7	-6.3%
3,621 tok	~3,600 chars	115.8	-9.4%
7,221 tok	~7,200 chars	108.2	-15.3%
18,021 tok	~18,000 chars	91.1	-28.7%

★ Measured on the RTX 3090 (24GB). Measured April 2026.

The longer the prompt, the larger the KV cache and the more data to process, so speed falls. A short ~60-character question runs at 127.8 tok/s, but feeding in 18,000 characters (about 45 manuscript pages) drops it to 91.1 tok/s.

Even so, 91 tok/s with 18,000 characters of input is plenty comfortable. By the feel-based benchmark of 30 tok/s being “comfortable" and 40+ “comes back right away," even a very long context poses no practical problem.

A note on the relationship between tokens and Japanese characters. Measured with qwen3:8b, Japanese came out at roughly 1–1.2 characters per token. In English one token is about 4 characters, but Japanese consumes more tokens per character. The “~60 chars" and “~1,800 chars" figures in this article are based on that conversion.

The multi-turn conversation case

In real chat, it’s not a single long input but short exchanges piling up over and over. In a 15-turn conversation (4,557 tokens), the slowdown was just -0.6%.

This is thanks to Ollama’s KV-cache reuse. Rather than reprocessing the entire past history from scratch each time, it reuses cached results, keeping overhead small even as turns accumulate.

Hands-on: For ordinary chatting, you won’t feel any slowdown even over about 15 turns. Speed only becomes noticeable when you paste an entire long document and ask for a summary.

What happens when you hit the context limit

Ollama has a per-model context limit. What happens when you hit it is surprisingly little-known.

The session doesn’t stop. No error appears. When you exceed the limit, Ollama silently drops old conversation history. Without warning, it deletes from the start of the conversation onward. From the user’s side, it surfaces as “hmm, it doesn’t remember what we just talked about."

The context limit is effectively decided by free VRAM. What’s left after subtracting the model’s own VRAM usage is what’s available for the KV cache. And that’s where the trade-off with parallel processing appears.

The 27B model: concurrency vs context length

For qwen3.5:27b (26.2GB total when loaded, about 10GB free for the KV cache):

Concurrency	Context limit per request	Approx. Japanese chars
1	~24,000 tok	~24,000 chars
4	~6,000 tok	~6,000 chars
8	~3,000 tok	~3,000 chars

As concurrency rises, each request shares the KV cache, so the usable context length per request shrinks. Used by one person, you can hold ~24,000 characters (about 60 manuscript pages) of context; used by eight at once, it drops to about 3,000 characters per person.

3,000 characters is roughly “10 back-and-forths of a quick Q&A." Plenty for everyday chat, but short for summarizing long documents or referencing a lot of prior conversation.

Note: Ollama gives no warning when you exceed the context limit. If you’re in a long conversation and feel “the AI’s replies have gotten strange," it may have hit the context limit and dropped old history. Starting a new session is the reliable fix.

Chart: concurrency vs total throughput

How to read this chart: the longer the bar, the greater the total throughput (the total tokens all requests generate per second). If the bars grow proportionally as concurrency rises, it means near-perfect linear scaling. 16,081 tok/s at 128-way concurrency is striking.

★ Measured with qwen3:8b on the RTX 3090 (24GB). Measured April 2026.

The 27B model (qwen3.5:27b, split across RTX 3090 + 3060) shows the same trend: 209 tok/s total at 8-way concurrency, with zero per-request slowdown.

The figure “16,081 tok/s at 128-way concurrency" is a benchmark maximum; 128 people chatting at once is hard to imagine for home use. But 8–16 people using it simultaneously is realistic, and under 1% slowdown across that range shows it can comfortably handle small-team use.

Cost comparison: local AI vs cloud AI

Finally, a cost comparison. I’ll verify with numbers whether investing in a GPU really makes sense, based on prices as of April 2026.

Configuration	Upfront cost	Monthly cost	Concurrent users	Model quality
ChatGPT Plus × 1 person	¥0	¥3,000	1	GPT-5 class
ChatGPT Plus × 8 people	¥0	¥24,000	8	GPT-5 class
Used RTX 3090 + used RTX 3060	~¥170k	Electricity ~¥5,000	8 (27B) / 128 (8B)	27B or 8B

Annual cost comparison (8-person use)

ChatGPT Plus × 8 people: ¥24,000/month × 12 months = ¥288,000/year
Local AI (dual-GPU): ¥170k upfront + electricity ¥5,000/month × 12 months = ¥230k the first year, ¥60k per year thereafter
Difference: ¥58k saved in year one, ¥228k per year from year two

Of course, there’s a quality gap between GPT-5-class and a local 27B model. If you need cutting-edge reasoning or coding help, ChatGPT Plus or Claude Pro is well worth paying for.

Local AI shows its true value in cases like these.

Privacy matters: you don’t want to send internal documents or personal data outside
Many users: cloud subscriptions grow with each family or team member, while local costs nothing extra no matter how many people use it
Offline use: it works reliably even when the network is flaky
Customization: you’re free to choose or fine-tune models

Hands-on: In my house, I bought the RTX 3090 new at its ¥300,000 list price and added an RTX 3060 12GB used for about ¥40,000. If you were buying the RTX 3090 used today, it’d be around ¥130k–180k. Combined with a used RTX 3060 12GB at about ¥40k, ¥170k–220k upfront is a realistic estimate.

Setup steps

Here’s a concise summary of how to actually build a dual-GPU setup. It assumes Linux (Ubuntu) as the OS, but the basic ideas are the same on Windows.

1. Configure GPU assignment (CUDA_VISIBLE_DEVICES)

Which GPU Ollama uses is controlled by the environment variable CUDA_VISIBLE_DEVICES. Specifying it in the systemd service file is the reliable way.

# Edit Ollama's systemd service file
sudo systemctl edit ollama

# Add the following
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

To have Ollama recognize both GPU0 and GPU1, specify 0,1. If you want to pin ComfyUI to a specific GPU, specify it in ComfyUI’s launch script like CUDA_VISIBLE_DEVICES=1.

# Example: dedicate ComfyUI to GPU1 (RTX 3090)
CUDA_VISIBLE_DEVICES=1 python main.py --listen

2. Share on the LAN with Open WebUI

To let family or team members access it from a browser, Open WebUI is handy.

# Launch with Docker (expose port 3000 on the LAN)
docker run -d
  --name open-webui
  -p 3000:8080
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434
  --add-host=host.docker.internal:host-gateway
  --restart always
  ghcr.io/open-webui/open-webui:main

From another PC or phone on the LAN, access http://<server-IP-address>:3000 to use local AI in a ChatGPT-like UI. You can create per-user accounts, so chat history is managed separately for each person.

3. Confirm it’s working

# Check GPU recognition
nvidia-smi

# List Ollama models
ollama list

# Test a model
ollama run qwen3:8b "Hello"

# Currently loaded models and GPU usage
ollama ps

If nvidia-smi shows both GPUs and ollama ps shows the model loaded on the intended GPU, setup is complete.

Summary

Here’s what the measured dual-GPU data reveals.

Three benefits of a dual-GPU setup:

More VRAM: Ollama’s automatic splitting runs large models (27B) that won’t fit on one card
Task separation: run chat and image generation simultaneously on different GPUs without interference
Parallel processing: under 1% slowdown even at 128-way concurrency. Your whole family or team can use it at once

Points to watch:

VRAM is not unified (24+12 does not become 36GB)
Raising concurrency shortens the per-person context length
When the context limit is hit, Ollama drops old conversation without warning

The parallel performance far exceeded my expectations. A throughput of 16,081 tok/s at 128-way concurrency has an overwhelming cost advantage over the metered billing of cloud APIs. The image of “local AI as a single-person tool" is, on this data, completely overturned.

A used RTX 3090 runs ¥130k–180k, a used RTX 3060 12GB about ¥40k. For ¥170k–220k total, you get an environment where eight or more people can chat with AI at once. Compared with a year of cloud billing, it pays for itself within the first year.

GPUs used in this article

[kimono_product id="15761″]

[kimono_product id="15759″]

Related:

Local LLM benchmark comparison (in preparation)
ComfyUI setup guide (in preparation)

▶ Go deeper on local AI (related)

GPUs & Gear,Local AI