Getting Started with Ollama: Run a Local AI Chatbot in 10 Minutes
I use Ollama every day to run local LLMs — AI chat that runs entirely on my own PC. On a dual-GPU machine, I handle chat, proofreading, coding help, and even image descriptions without sending a single byte to the cloud.
This article walks through everything from installing Ollama to actually using it in practice, with measured numbers from my own machine. In the budget guide I covered “what you need to run local AI"; this time we go one step further — actually getting your hands dirty and setting Ollama up.
The short version: setting up Ollama takes only a few steps. In about 10 minutes you can be chatting with a model in your own language.
What is Ollama?
Right now it’s the easiest tool for running local LLMs.
- Windows / Mac / Linux: works the same way on every OS
- One command to download and run a model: a single line in the terminal starts an AI chat
- Free and open source: no usage fees at all
- OpenAI-compatible API: apps and tools built for ChatGPT work as-is
- Simple model management: download, delete, and list models with one command each
Under the hood, Ollama runs a server in the background and exposes an API at localhost:11434. That covers terminal chat, browser UIs, and calls from your own code.
Installation (by OS)
Windows
- Go to ollama.com
- Click “Download for Windows"
- Run the downloaded installer (OllamaSetup.exe)
- Follow the prompts (no special configuration needed)
- When it’s done, open PowerShell or Command Prompt
ollama --version
If a version number appears, the install succeeded.
Mac
With Homebrew:
brew install ollama
With the installer:
- Go to ollama.com
- Click “Download for macOS"
- Drag the downloaded app into your Applications folder
- Launch Ollama.app (an icon appears in the menu bar)
Open a terminal and confirm with ollama --version.
Linux
Just one line in the terminal.
curl -fsSL https://ollama.com/install.sh | sh
If you want to use an NVIDIA GPU, you need the CUDA drivers installed first. As long as the nvidia-smi command works, you’re good.
# Check that the GPU is recognized nvidia-smi
If your GPU name and driver version show up, you’re set.
Running your first model
Once installed, run this in the terminal.
ollama run qwen3:8b
That’s it. The first time, the model downloads automatically.
Download:
- Model size: about 5.2 GB
- Roughly 1–5 minutes depending on your connection
- Only downloads once; subsequent launches are instant
Startup:
- Cold start (loading the model into VRAM): about 2.1 seconds on an RTX 3090 (measured)
- When loading finishes, a
>>>prompt appears and you can chat right away
Try talking to it.
>>> Hello. Please introduce yourself.
If you get a reply, it’s working. To end the chat, type /bye or press Ctrl+D.
Recommended models (measured data)
There are hundreds of models available through Ollama, but only a handful are genuinely practical. Here’s data I measured on my own machine.
Per-model benchmark
| Model | DL size | VRAM used | Speed (tok/s) | Quality | Best for |
|---|---|---|---|---|---|
| ★ qwen3:8b | 5.2GB | 10.3GB | 126.4 | ○ Decent | Everyday chat, simple questions |
| ★ qwen3.5:9b | 6.6GB | 9.8GB | 98.0 | ○ Good | Proofreading, coding help |
| ★ gemma4 (8B) | 9.6GB | 11.2GB | 133.0 | ○ Good | When you want fast responses |
| ★ qwen3.5:27b | 17.4GB | 18.2GB* | 25.5 | ◎ Very good | Serious Q&A, summarization |
Test setup: RTX 3090 (24GB) / Linux / Ollama 0.20.2 / measured April 2026
The 27b model was measured with a two-GPU split load (RTX 3090 + RTX 3060)
★ = author-measured values (RTX 3090 / RTX 3060, April 2026). For how estimates are calculated, see the full GPU spec list.
How to read this data
Speed (tok/s) is “tokens generated per second." Here’s a rough feel for it.
| tok/s | Feel |
|---|---|
| 15 or less | Slow. You wait. |
| 20 | A slight wait, but readable |
| 30 | Comfortable |
| 40+ | Comes back instantly |
qwen3:8b at 126 tok/s is “text pouring down like a waterfall." It far exceeds human reading speed, so there’s zero waiting stress. qwen3.5:27b at 25.5 tok/s (two-GPU split) sits in the “comfortable" range — a natural pace to read even for longer answers.
Recommendation by VRAM
8 GB of VRAM means 8B models, full stop. qwen3:8b uses 10.3 GB, but the standard tag is already 4-bit quantized (Q4_K_M), and much of that usage is context memory. Set a shorter context length and you can run it on 8 GB.
16 GB makes 9B models comfortable. qwen3.5:9b and gemma4 run at their default settings with room to spare.
24 GB and up opens the door to 27B. qwen3.5:27b uses 18.2 GB. An RTX 3090 (24 GB) handles it easily. The quality of a 27B model is clearly a notch above 8B — the “wait, this is running locally?" level.
Essential commands
Everything in Ollama is done from the terminal. There are only six commands to learn.
| Command | What it does | Example |
|---|---|---|
ollama run <model> |
Start chat (auto-downloads if needed) | ollama run qwen3:8b |
ollama pull <model> |
Download a model only | ollama pull gemma4 |
ollama list |
List downloaded models | ollama list |
ollama ps |
Show currently running models | ollama ps |
ollama rm <model> |
Delete a model (free up storage) | ollama rm qwen3:8b |
ollama show <model> |
Show model details | ollama show qwen3:8b |
Common patterns
Try a model:
ollama run qwen3:8b
Delete unused models to free storage:
ollama list # check the list ollama rm gemma4 # delete a model you don't need
Check what’s running right now:
ollama ps
It also shows VRAM usage, which is handy when you’re wondering “wait, why am I out of VRAM?"
Using a ChatGPT-like UI (Open WebUI)
Terminal chat is great for a quick sanity check, but for daily use a browser UI is much nicer. With Open WebUI you can chat with your Ollama models in a ChatGPT-style interface.
Setup (one Docker command)
If you already have Docker installed, just run this.
docker run -d -p 3000:8080 --gpus all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:ollama
Once it’s up, open http://localhost:3000 in your browser and the chat screen appears. On first access it asks you to create an account — this is a local account (nothing is sent externally).
Why Open WebUI is nice
- Switch between models: flip from qwen3:8b to gemma4 with a dropdown
- Saved chat history: every past conversation is kept, and searchable
- File upload: drag and drop text files or PDFs to hand them over
- Accessible from other PCs and phones on your LAN: reach it from every device at home via
http://<server-IP>:3000
Choosing a GPU
The GPU you need to run Ollama comfortably comes down to the size of the model you want to run.
| GPU (VRAM) | Runnable models | Speed feel | Used price (as of April 2026) |
|---|---|---|---|
| GTX 1660 (6GB) | 4B models only | Slow (under 15 tok/s) | Used ¥10–20k |
| ★ RTX 3060 12GB | 8B–12B | Practical (60 tok/s) | Used ¥20–35k |
| RTX 4060 Ti 16GB | up to 14B | Comfortable (23–42 tok/s) | Used ¥45–60k |
| ★ RTX 3090 24GB | 27B–32B | Serious (25.5 tok/s+) | Used ¥130–180k |
| Mac M4 Pro 24GB | 14B–27B | Comfortable (20–40 tok/s) | Price of the Mac |
★ = author-measured values (RTX 3090 / RTX 3060, April 2026). For how estimates are calculated, see the full GPU spec list.
How to read this table
The amount of VRAM decides “how large a model you can run," and the model’s size decides “how smart the AI is." In short, VRAM ≒ the ceiling on how smart your AI can be.
- “Just want to try it": RTX 3060 12GB (used ¥20–30k) with an 8B model. Plenty for everyday questions
- “Want to use it for work": RTX 4060 Ti 16GB (used ~¥50k) with a 14B model. Proofreading and coding help become genuinely useful
- “Want to go all in": RTX 3090 24GB (used ¥130k+) with a 27B–32B model. Quality close to cloud AI
For a detailed GPU comparison, see “Starting local AI with a used GPU."
Common problems and fixes
Ollama is stable software, but there are a few spots where the first setup tends to trip people up.
| Problem | Cause | Fix |
|---|---|---|
| “out of memory" error | Not enough VRAM | Switch to a smaller model. If 8B fails, try 4B |
| Responses are unusually slow | GPU not detected; running on CPU | Check that nvidia-smi sees the GPU. If not, reinstall the driver |
| Output quality is poor in your language | The model’s language ability | Switch to a qwen3 or gemma4 model; the llama family is weaker outside English |
| Cold start is long (10s+) | Loading the model into VRAM | Normal. Later launches are fast (the model stays in memory) |
Connection error on ollama run |
The Ollama server isn’t running | Start it manually with ollama serve. On Linux, systemctl start ollama |
| Model download stops partway | Network issue | Re-run the same command; it resumes from where it left off |
How to confirm the GPU is recognized
# For NVIDIA GPUs nvidia-smi
If the output shows your GPU’s name (e.g., “NVIDIA GeForce RTX 3090"), you’re good. If not, you need to install the NVIDIA driver.
# Check which GPU Ollama is using ollama ps
If the PROCESSOR column of ollama ps shows something like “100% GPU," inference is running on the GPU. If it says “100% CPU," the GPU isn’t being used.
Bonus: it handles concurrent users
Ollama can process multiple requests at once, so a single PC can serve AI to your whole family or team at the same time.
Measured data
I measured concurrent-access performance on my machine (RTX 3090).
qwen3:8b (8B model):
- Single user: 126.4 tok/s
- 128 concurrent users: 125.6 tok/s (just 0.6% slower)
qwen3.5:27b (27B model):
- Single user: 25.5 tok/s
- 8 concurrent users: 25.8 tok/s (essentially no drop)
With an 8B model, even 128 simultaneous users barely dent the speed. If it’s just 3–4 family members using it at home, you have nothing to worry about on the performance side.
How to share it
Sharing at home is easy with Open WebUI.
- Find the IP address of the PC running Open WebUI (e.g.,
192.168.1.100) - From another PC, phone, or tablet, open
http://192.168.1.100:3000in the browser - Have each person create an account and log in
Chat history is separated per account, so privacy is preserved.
Summary
As shown throughout, setting up Ollama takes 10 minutes.
- Install: one command or installer per OS
- First chat:
ollama run qwen3:8bstarts a conversation - Browser UI: add Open WebUI for a ChatGPT-like feel
- Share with family: reachable from every device on your LAN
The first hurdle is choosing a GPU, but a used RTX 3060 12GB (¥20–30k) is plenty practical with an 8B model. If you already have a GPU, try ollama run qwen3:8b right now.
GPUs mentioned in this article
Once you’ve experienced how convenient local AI is, the peace of mind of “I don’t have to send this anywhere" and the ease of “no monthly fee" make it hard to give up. That’s exactly what happened to me.
・"Running a local AI chatbot at home: a budget-by-budget guide" — what you need for local AI, broken down by budget
・"Starting local AI with a used GPU: the value of the RTX 30/40 series" — a detailed GPU comparison
Prices and benchmark figures in this article are as of April 2026. Ollama’s command set and model lineup may change. Check ollama.com for the latest.


Recent Comments