Getting Started with Ollama: Run a Local AI Chatbot in 10 Minutes

2026年7月5日

I use Ollama every day to run local LLMs — AI chat that runs entirely on my own PC. On a dual-GPU machine, I handle chat, proofreading, coding help, and even image descriptions without sending a single byte to the cloud.

This article walks through everything from installing Ollama to actually using it in practice, with measured numbers from my own machine. In the budget guide I covered “what you need to run local AI"; this time we go one step further — actually getting your hands dirty and setting Ollama up.

The short version: setting up Ollama takes only a few steps. In about 10 minutes you can be chatting with a model in your own language.

What is Ollama?

Right now it’s the easiest tool for running local LLMs.

Windows / Mac / Linux: works the same way on every OS
One command to download and run a model: a single line in the terminal starts an AI chat
Free and open source: no usage fees at all
OpenAI-compatible API: apps and tools built for ChatGPT work as-is
Simple model management: download, delete, and list models with one command each

Under the hood, Ollama runs a server in the background and exposes an API at localhost:11434. That covers terminal chat, browser UIs, and calls from your own code.

Installation (by OS)

Windows

Go to ollama.com
Click “Download for Windows"
Run the downloaded installer (OllamaSetup.exe)
Follow the prompts (no special configuration needed)
When it’s done, open PowerShell or Command Prompt

ollama --version

If a version number appears, the install succeeded.

Mac

With Homebrew:

brew install ollama

With the installer:

Go to ollama.com
Click “Download for macOS"
Drag the downloaded app into your Applications folder
Launch Ollama.app (an icon appears in the menu bar)

Open a terminal and confirm with ollama --version.

Linux

Just one line in the terminal.

curl -fsSL https://ollama.com/install.sh | sh

If you want to use an NVIDIA GPU, you need the CUDA drivers installed first. As long as the nvidia-smi command works, you’re good.

# Check that the GPU is recognized
nvidia-smi

If your GPU name and driver version show up, you’re set.

Note: Install commands and the UI may change over time. If something doesn’t work, check the latest steps at ollama.com.

Running your first model

Once installed, run this in the terminal.

ollama run qwen3:8b

That’s it. The first time, the model downloads automatically.

Download:

Model size: about 5.2 GB
Roughly 1–5 minutes depending on your connection
Only downloads once; subsequent launches are instant

Startup:

Cold start (loading the model into VRAM): about 2.1 seconds on an RTX 3090 (measured)
When loading finishes, a >>> prompt appears and you can chat right away

Try talking to it.

>>> Hello. Please introduce yourself.

If you get a reply, it’s working. To end the chat, type /bye or press Ctrl+D.

Hands-on: On my machine (RTX 3090), qwen3:8b went from launch to first response in about 3 seconds. Because text streams in token by token, it feels even faster than the numbers suggest. On a fast connection, you can go from typing the command to “chatting with a local AI" in about 3 minutes.

Recommended models (measured data)

There are hundreds of models available through Ollama, but only a handful are genuinely practical. Here’s data I measured on my own machine.

Per-model benchmark

Model	DL size	VRAM used	Speed (tok/s)	Quality	Best for
★ qwen3:8b	5.2GB	10.3GB	126.4	○ Decent	Everyday chat, simple questions
★ qwen3.5:9b	6.6GB	9.8GB	98.0	○ Good	Proofreading, coding help
★ gemma4 (8B)	9.6GB	11.2GB	133.0	○ Good	When you want fast responses
★ qwen3.5:27b	17.4GB	18.2GB*	25.5	◎ Very good	Serious Q&A, summarization

Test setup: RTX 3090 (24GB) / Linux / Ollama 0.20.2 / measured April 2026
The 27b model was measured with a two-GPU split load (RTX 3090 + RTX 3060)

★ = author-measured values (RTX 3090 / RTX 3060, April 2026). For how estimates are calculated, see the full GPU spec list.

How to read this data

Speed (tok/s) is “tokens generated per second." Here’s a rough feel for it.

tok/s	Feel
15 or less	Slow. You wait.
20	A slight wait, but readable
30	Comfortable
40+	Comes back instantly

qwen3:8b at 126 tok/s is “text pouring down like a waterfall." It far exceeds human reading speed, so there’s zero waiting stress. qwen3.5:27b at 25.5 tok/s (two-GPU split) sits in the “comfortable" range — a natural pace to read even for longer answers.

Recommendation by VRAM

8 GB of VRAM means 8B models, full stop. qwen3:8b uses 10.3 GB, but the standard tag is already 4-bit quantized (Q4_K_M), and much of that usage is context memory. Set a shorter context length and you can run it on 8 GB.

16 GB makes 9B models comfortable. qwen3.5:9b and gemma4 run at their default settings with room to spare.

24 GB and up opens the door to 27B. qwen3.5:27b uses 18.2 GB. An RTX 3090 (24 GB) handles it easily. The quality of a 27B model is clearly a notch above 8B — the “wait, this is running locally?" level.

Essential commands

Everything in Ollama is done from the terminal. There are only six commands to learn.

Command	What it does	Example
`ollama run <model>`	Start chat (auto-downloads if needed)	`ollama run qwen3:8b`
`ollama pull <model>`	Download a model only	`ollama pull gemma4`
`ollama list`	List downloaded models	`ollama list`
`ollama ps`	Show currently running models	`ollama ps`
`ollama rm <model>`	Delete a model (free up storage)	`ollama rm qwen3:8b`
`ollama show <model>`	Show model details	`ollama show qwen3:8b`

Common patterns

Try a model:

ollama run qwen3:8b

Delete unused models to free storage:

ollama list          # check the list
ollama rm gemma4  # delete a model you don't need

Check what’s running right now:

ollama ps

It also shows VRAM usage, which is handy when you’re wondering “wait, why am I out of VRAM?"

Using a ChatGPT-like UI (Open WebUI)

Terminal chat is great for a quick sanity check, but for daily use a browser UI is much nicer. With Open WebUI you can chat with your Ollama models in a ChatGPT-style interface.

Setup (one Docker command)

If you already have Docker installed, just run this.

docker run -d -p 3000:8080 --gpus all
  -v ollama:/root/.ollama
  -v open-webui:/app/backend/data
  --name open-webui
  ghcr.io/open-webui/open-webui:ollama

Once it’s up, open http://localhost:3000 in your browser and the chat screen appears. On first access it asks you to create an account — this is a local account (nothing is sent externally).

Why Open WebUI is nice

Switch between models: flip from qwen3:8b to gemma4 with a dropdown
Saved chat history: every past conversation is kept, and searchable
File upload: drag and drop text files or PDFs to hand them over
Accessible from other PCs and phones on your LAN: reach it from every device at home via http://<server-IP>:3000

Hands-on: At my house, my wife opens Open WebUI from the browser on her iPad to ask for recipes. No installing Ollama, no terminal — it feels exactly like ChatGPT. She just bookmarked the server’s IP address.

Note: If you already have Ollama installed and don’t want to use Docker, there’s a different launch command. Check the latest install methods in the official Open WebUI repo.

Choosing a GPU

The GPU you need to run Ollama comfortably comes down to the size of the model you want to run.

GPU (VRAM)	Runnable models	Speed feel	Used price (as of April 2026)
GTX 1660 (6GB)	4B models only	Slow (under 15 tok/s)	Used ¥10–20k
★ RTX 3060 12GB	8B–12B	Practical (60 tok/s)	Used ¥20–35k
RTX 4060 Ti 16GB	up to 14B	Comfortable (23–42 tok/s)	Used ¥45–60k
★ RTX 3090 24GB	27B–32B	Serious (25.5 tok/s+)	Used ¥130–180k
Mac M4 Pro 24GB	14B–27B	Comfortable (20–40 tok/s)	Price of the Mac

★ = author-measured values (RTX 3090 / RTX 3060, April 2026). For how estimates are calculated, see the full GPU spec list.

How to read this table

The amount of VRAM decides “how large a model you can run," and the model’s size decides “how smart the AI is." In short, VRAM ≒ the ceiling on how smart your AI can be.

“Just want to try it": RTX 3060 12GB (used ¥20–30k) with an 8B model. Plenty for everyday questions
“Want to use it for work": RTX 4060 Ti 16GB (used ~¥50k) with a 14B model. Proofreading and coding help become genuinely useful
“Want to go all in": RTX 3090 24GB (used ¥130k+) with a 27B–32B model. Quality close to cloud AI

For a detailed GPU comparison, see “Starting local AI with a used GPU."

Common problems and fixes

Ollama is stable software, but there are a few spots where the first setup tends to trip people up.

Problem	Cause	Fix
“out of memory" error	Not enough VRAM	Switch to a smaller model. If 8B fails, try 4B
Responses are unusually slow	GPU not detected; running on CPU	Check that `nvidia-smi` sees the GPU. If not, reinstall the driver
Output quality is poor in your language	The model’s language ability	Switch to a qwen3 or gemma4 model; the llama family is weaker outside English
Cold start is long (10s+)	Loading the model into VRAM	Normal. Later launches are fast (the model stays in memory)
Connection error on `ollama run`	The Ollama server isn’t running	Start it manually with `ollama serve`. On Linux, `systemctl start ollama`
Model download stops partway	Network issue	Re-run the same command; it resumes from where it left off

How to confirm the GPU is recognized

# For NVIDIA GPUs
nvidia-smi

If the output shows your GPU’s name (e.g., “NVIDIA GeForce RTX 3090"), you’re good. If not, you need to install the NVIDIA driver.

# Check which GPU Ollama is using
ollama ps

If the PROCESSOR column of ollama ps shows something like “100% GPU," inference is running on the GPU. If it says “100% CPU," the GPU isn’t being used.

Bonus: it handles concurrent users

Ollama can process multiple requests at once, so a single PC can serve AI to your whole family or team at the same time.

Measured data

I measured concurrent-access performance on my machine (RTX 3090).

qwen3:8b (8B model):

Single user: 126.4 tok/s
128 concurrent users: 125.6 tok/s (just 0.6% slower)

qwen3.5:27b (27B model):

Single user: 25.5 tok/s
8 concurrent users: 25.8 tok/s (essentially no drop)

With an 8B model, even 128 simultaneous users barely dent the speed. If it’s just 3–4 family members using it at home, you have nothing to worry about on the performance side.

How to share it

Sharing at home is easy with Open WebUI.

Find the IP address of the PC running Open WebUI (e.g., 192.168.1.100)
From another PC, phone, or tablet, open http://192.168.1.100:3000 in the browser
Have each person create an account and log in

Chat history is separated per account, so privacy is preserved.

Summary

As shown throughout, setting up Ollama takes 10 minutes.

Install: one command or installer per OS
First chat: ollama run qwen3:8b starts a conversation
Browser UI: add Open WebUI for a ChatGPT-like feel
Share with family: reachable from every device on your LAN

The first hurdle is choosing a GPU, but a used RTX 3060 12GB (¥20–30k) is plenty practical with an 8B model. If you already have a GPU, try ollama run qwen3:8b right now.

GPUs mentioned in this article

【中古】ELSA GeForce RTX 3060 12GB

¥52,800 (as of 2026-06-22)

Amazon 楽天市場 Yahoo!

MSI GeForce RTX 5060 Ti 16GB VENTUS 2X OC PLUS

¥102,353 (as of 2026-06-22)

Amazon 楽天市場 Yahoo!

Once you’ve experienced how convenient local AI is, the peace of mind of “I don’t have to send this anywhere" and the ease of “no monthly fee" make it hard to give up. That’s exactly what happened to me.

Related
・"Running a local AI chatbot at home: a budget-by-budget guide" — what you need for local AI, broken down by budget
・"Starting local AI with a used GPU: the value of the RTX 30/40 series" — a detailed GPU comparison

Prices and benchmark figures in this article are as of April 2026. Ollama’s command set and model lineup may change. Check ollama.com for the latest.

▶ Go deeper on local AI (related)