Dual-GPU Local AI: Running RTX 3090 + RTX 3060 Together
I run local AI on a PC with two GPUs. I pair an LLM for chat in Ollama with image generation in ComfyUI, or fire up several LLMs at once. In this article I’ve organized, using measured data, whether there’s really any benefit to running two GPUs.
The conclusion: two GPUs are absolutely worth it. But the expectation that “VRAM becomes 24+12 = 36GB" is wrong. The real value of a dual-GPU setup lies in “running tasks simultaneously" and “Ollama’s automatic splitting." And when I measured parallel processing, the results far exceeded what I expected.
All the numbers in this article are measured on my own setup (RTX 3090 + RTX 3060 12GB, Linux, Ollama), as of April 2026.
- 1. What you can do with two GPUs
- 2. VRAM isn’t unified, but Ollama splits automatically
- 3. Measured data: baseline performance
- 4. The stunning parallel-test results
- 5. The parallel-processing trade-off: context length
- 6. What happens when you hit the context limit
- 7. Chart: concurrency vs total throughput
- 8. Cost comparison: local AI vs cloud AI
- 9. Setup steps
- 10. Summary
What you can do with two GPUs
With two GPUs installed, there are broadly three usable patterns.
Pattern 1: Separating tasks
Run an 8B chat model in Ollama on GPU0 (RTX 3060) while running ComfyUI image generation on GPU1 (RTX 3090) at the same time. This is my everyday usage. Being able to chat while waiting on an image makes work far more efficient.
Pattern 2: Running models simultaneously
Spin up an 8B model on GPU0 (RTX 3060) and a 27B model on GPU1 (RTX 3090) at once. Throw light questions at the 8B and involved consultations at the 27B. Task-appropriate use is self-contained on a single PC.
Pattern 3: Splitting a large model across GPUs
Ollama automatically splits a model that won’t fit in 24GB across the two GPUs. I explain this in detail in the next section.
One important point. VRAM is not unified. The RTX 3090’s 24GB and the RTX 3060’s 12GB do not combine into 36GB. Each GPU has its own independent memory space. There are cases where connecting with NVLink lets you share a VRAM pool, but NVLink isn’t available for the RTX 3090 + RTX 3060 combination.
VRAM isn’t unified, but Ollama splits automatically
If VRAM isn’t unified, does that mean a model over 24GB won’t run? The answer is “with Ollama, it will."
For example, qwen3.5:27b needs about 17.4GB of VRAM for the whole model. It fits in the RTX 3090’s 24GB, but not in the RTX 3060’s 12GB. So what happens when you load it in a dual-GPU environment?
Ollama automatically distributes the model’s layers across multiple GPUs. For qwen3.5:27b, it split-loaded as 17.9GB on the RTX 3090 and 8.4GB on the RTX 3060 — 26.2GB total. You don’t specify this in a config; Ollama decides automatically by looking at the free VRAM.
Thanks to this mechanism, even a huge model that won’t fit on one GPU will run if you split it across two. That said, since data has to move between the two GPUs, there’s overhead compared to fitting everything on one card. Just how much overhead? The next benchmark shows it.
ollama run qwen3.5:27b and it loads with the optimal distribution on its own.Measured data: baseline performance
My PC has an RTX 3090 and an RTX 3060 installed.
Here’s the baseline performance data for each GPU and model.
| GPU | Model | VRAM used | Generation speed (tok/s) | Cold start | System power draw |
|---|---|---|---|---|---|
| RTX 3090 | gemma4 | 21.6GB | 133.0 | 11.0s | ~299W |
| RTX 3090 | qwen3.5:27b | 18.2GB (split) | 25.5 | – | ~258W |
| RTX 3090 | qwen3:8b | 10.3GB | 126.4 | 2.1s | ~295W |
| RTX 3060 | qwen3:8b | 5.5GB | 60.1 | 2.1s | ~170W |
| RTX 3060 | qwen3.5:9b | 7.9GB | 46.6 | 7.2s | ~337W |
| Simultaneous | 3060:8B + 3090:27B | 8.4+18.2GB | 119+25.5 tok/s | – | ~374W |
★ = author-measured (RTX 3090 / RTX 3060, April 2026). For how estimates are calculated, see the full GPU spec list.
A few things stand out.
qwen3:8b runs at 126.4 tok/s on the RTX 3090 and 60.1 tok/s on the RTX 3060. The bandwidth difference (936 vs 360 GB/s) shows up in the speed. Even the RTX 3060’s 60 tok/s is plenty comfortable, but the RTX 3090 is twice as fast. With two GPUs, an efficient setup puts the main work on the RTX 3090 and runs subtasks on the RTX 3060.
gemma4’s 133.0 tok/s feels “instant." On my setup I barely notice any practical response lag. As a rough feel for tok/s: 15 or below is slow (you wait), 30 is comfortable, and 40+ comes back right away. At 133 tok/s, the text starts pouring in the moment you finish typing.
When running simultaneously, the 27B model’s speed (25.5 tok/s) is unchanged from running it alone. The 8B side dips slightly from 126 to 119 tok/s, but you don’t notice it. In other words, running two completely different models at once causes almost no practical performance loss.
The stunning parallel-test results
Here’s the heart of the article. I measured “what happens when you throw multiple requests at a single model at once."
Parallel test of qwen3:8b on the RTX 3090
| Concurrency | tok/s per request | Total throughput (tok/s) | Slowdown |
|---|---|---|---|
| 1 | 126.4 | 126 | baseline |
| 8 | 125.8 | 1,006 | -0.5% |
| 16 | 127.2 | 2,035 | +0.6% |
| 32 | 125.7 | 4,021 | -0.6% |
| 64 | 125.5 | 8,034 | -0.7% |
| 128 | 125.6 | 16,081 | -0.6% |
★ Measured on the RTX 3090 (24GB). Baseline is the single measured value, 126.4 tok/s. Measured April 2026.
This result surprised me. Even at 128-way concurrency, the per-request speed barely drops. Total throughput grows from 126 tok/s to 16,081 tok/s — a 127x increase. It’s near-perfect linear scaling.
Parallel test of qwen3.5:27b (split across RTX 3090 + RTX 3060)
| Concurrency | tok/s per request | Total throughput (tok/s) | Slowdown |
|---|---|---|---|
| 1 | 25.5 | 26 | baseline |
| 4 | 26.1 | 105 | none |
| 8 | 26.2 | 209 | none |
★ Measured with a 2-GPU split across the RTX 3090 + RTX 3060. Measured April 2026.
The 27B model shows the same trend. No slowdown even at 8-way concurrency. Even split across two GPUs, the parallel-processing scaling doesn’t break down.
Why does this happen?
This “concurrency goes up but speed doesn’t fall" behavior makes sense once you understand how LLM inference works.
The model weights (parameters) are loaded onto the GPU just once. 10.3GB of model data is the same 10.3GB whether it’s 1 request or 128. What grows is only each request’s KV cache (the memory that holds the conversation context), and for short conversations that stays around a few MB per request.
The bottleneck in LLM inference is not the GPU’s compute power but memory bandwidth. The speed of reading the model weights from VRAM is the bottleneck. Even with multiple requests, the weights only need to be read once, so as concurrency rises, the per-request cost barely increases.
The parallel-processing trade-off: context length
I wrote that “raising concurrency doesn’t slow things down," but there is a factor that does slow it down. Not concurrency — the length of the prompt (input text).
qwen3:8b on the RTX 3090: prompt length vs speed
| Prompt length | Approx. Japanese chars | Generation speed (tok/s) | Change |
|---|---|---|---|
| 57 tok | ~60 chars | 127.8 | baseline |
| 381 tok | ~400 chars | 126.3 | -1.2% |
| 1,821 tok | ~1,800 chars | 119.7 | -6.3% |
| 3,621 tok | ~3,600 chars | 115.8 | -9.4% |
| 7,221 tok | ~7,200 chars | 108.2 | -15.3% |
| 18,021 tok | ~18,000 chars | 91.1 | -28.7% |
★ Measured on the RTX 3090 (24GB). Measured April 2026.
The longer the prompt, the larger the KV cache and the more data to process, so speed falls. A short ~60-character question runs at 127.8 tok/s, but feeding in 18,000 characters (about 45 manuscript pages) drops it to 91.1 tok/s.
Even so, 91 tok/s with 18,000 characters of input is plenty comfortable. By the feel-based benchmark of 30 tok/s being “comfortable" and 40+ “comes back right away," even a very long context poses no practical problem.
A note on the relationship between tokens and Japanese characters. Measured with qwen3:8b, Japanese came out at roughly 1–1.2 characters per token. In English one token is about 4 characters, but Japanese consumes more tokens per character. The “~60 chars" and “~1,800 chars" figures in this article are based on that conversion.
The multi-turn conversation case
In real chat, it’s not a single long input but short exchanges piling up over and over. In a 15-turn conversation (4,557 tokens), the slowdown was just -0.6%.
This is thanks to Ollama’s KV-cache reuse. Rather than reprocessing the entire past history from scratch each time, it reuses cached results, keeping overhead small even as turns accumulate.
What happens when you hit the context limit
Ollama has a per-model context limit. What happens when you hit it is surprisingly little-known.
The session doesn’t stop. No error appears. When you exceed the limit, Ollama silently drops old conversation history. Without warning, it deletes from the start of the conversation onward. From the user’s side, it surfaces as “hmm, it doesn’t remember what we just talked about."
The context limit is effectively decided by free VRAM. What’s left after subtracting the model’s own VRAM usage is what’s available for the KV cache. And that’s where the trade-off with parallel processing appears.
The 27B model: concurrency vs context length
For qwen3.5:27b (26.2GB total when loaded, about 10GB free for the KV cache):
| Concurrency | Context limit per request | Approx. Japanese chars |
|---|---|---|
| 1 | ~24,000 tok | ~24,000 chars |
| 4 | ~6,000 tok | ~6,000 chars |
| 8 | ~3,000 tok | ~3,000 chars |
As concurrency rises, each request shares the KV cache, so the usable context length per request shrinks. Used by one person, you can hold ~24,000 characters (about 60 manuscript pages) of context; used by eight at once, it drops to about 3,000 characters per person.
3,000 characters is roughly “10 back-and-forths of a quick Q&A." Plenty for everyday chat, but short for summarizing long documents or referencing a lot of prior conversation.
Chart: concurrency vs total throughput
How to read this chart: the longer the bar, the greater the total throughput (the total tokens all requests generate per second). If the bars grow proportionally as concurrency rises, it means near-perfect linear scaling. 16,081 tok/s at 128-way concurrency is striking.
★ Measured with qwen3:8b on the RTX 3090 (24GB). Measured April 2026.
The 27B model (qwen3.5:27b, split across RTX 3090 + 3060) shows the same trend: 209 tok/s total at 8-way concurrency, with zero per-request slowdown.
The figure “16,081 tok/s at 128-way concurrency" is a benchmark maximum; 128 people chatting at once is hard to imagine for home use. But 8–16 people using it simultaneously is realistic, and under 1% slowdown across that range shows it can comfortably handle small-team use.
Cost comparison: local AI vs cloud AI
Finally, a cost comparison. I’ll verify with numbers whether investing in a GPU really makes sense, based on prices as of April 2026.
| Configuration | Upfront cost | Monthly cost | Concurrent users | Model quality |
|---|---|---|---|---|
| ChatGPT Plus × 1 person | ¥0 | ¥3,000 | 1 | GPT-5 class |
| ChatGPT Plus × 8 people | ¥0 | ¥24,000 | 8 | GPT-5 class |
| Used RTX 3090 + used RTX 3060 | ~¥170k | Electricity ~¥5,000 | 8 (27B) / 128 (8B) | 27B or 8B |
Annual cost comparison (8-person use)
- ChatGPT Plus × 8 people: ¥24,000/month × 12 months = ¥288,000/year
- Local AI (dual-GPU): ¥170k upfront + electricity ¥5,000/month × 12 months = ¥230k the first year, ¥60k per year thereafter
- Difference: ¥58k saved in year one, ¥228k per year from year two
Of course, there’s a quality gap between GPT-5-class and a local 27B model. If you need cutting-edge reasoning or coding help, ChatGPT Plus or Claude Pro is well worth paying for.
Local AI shows its true value in cases like these.
- Privacy matters: you don’t want to send internal documents or personal data outside
- Many users: cloud subscriptions grow with each family or team member, while local costs nothing extra no matter how many people use it
- Offline use: it works reliably even when the network is flaky
- Customization: you’re free to choose or fine-tune models
Setup steps
Here’s a concise summary of how to actually build a dual-GPU setup. It assumes Linux (Ubuntu) as the OS, but the basic ideas are the same on Windows.
1. Configure GPU assignment (CUDA_VISIBLE_DEVICES)
Which GPU Ollama uses is controlled by the environment variable CUDA_VISIBLE_DEVICES. Specifying it in the systemd service file is the reliable way.
# Edit Ollama's systemd service file sudo systemctl edit ollama # Add the following [Service] Environment="CUDA_VISIBLE_DEVICES=0,1"
To have Ollama recognize both GPU0 and GPU1, specify 0,1. If you want to pin ComfyUI to a specific GPU, specify it in ComfyUI’s launch script like CUDA_VISIBLE_DEVICES=1.
# Example: dedicate ComfyUI to GPU1 (RTX 3090) CUDA_VISIBLE_DEVICES=1 python main.py --listen
2. Share on the LAN with Open WebUI
To let family or team members access it from a browser, Open WebUI is handy.
# Launch with Docker (expose port 3000 on the LAN) docker run -d --name open-webui -p 3000:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 --add-host=host.docker.internal:host-gateway --restart always ghcr.io/open-webui/open-webui:main
From another PC or phone on the LAN, access http://<server-IP-address>:3000 to use local AI in a ChatGPT-like UI. You can create per-user accounts, so chat history is managed separately for each person.
3. Confirm it’s working
# Check GPU recognition nvidia-smi # List Ollama models ollama list # Test a model ollama run qwen3:8b "Hello" # Currently loaded models and GPU usage ollama ps
If nvidia-smi shows both GPUs and ollama ps shows the model loaded on the intended GPU, setup is complete.
Summary
Here’s what the measured dual-GPU data reveals.
Three benefits of a dual-GPU setup:
- More VRAM: Ollama’s automatic splitting runs large models (27B) that won’t fit on one card
- Task separation: run chat and image generation simultaneously on different GPUs without interference
- Parallel processing: under 1% slowdown even at 128-way concurrency. Your whole family or team can use it at once
Points to watch:
- VRAM is not unified (24+12 does not become 36GB)
- Raising concurrency shortens the per-person context length
- When the context limit is hit, Ollama drops old conversation without warning
The parallel performance far exceeded my expectations. A throughput of 16,081 tok/s at 128-way concurrency has an overwhelming cost advantage over the metered billing of cloud APIs. The image of “local AI as a single-person tool" is, on this data, completely overturned.
A used RTX 3090 runs ¥130k–180k, a used RTX 3060 12GB about ¥40k. For ¥170k–220k total, you get an environment where eight or more people can chat with AI at once. Compared with a year of cloud billing, it pays for itself within the first year.
GPUs used in this article
[kimono_product id="15761″]
[kimono_product id="15759″]
- Local LLM benchmark comparison (in preparation)
- ComfyUI setup guide (in preparation)