Installing Ollama on Windows: Complete Guide
Ollama is the easiest way to run local AI models on Windows. It handles model downloads, GPU detection, and exposes an API automatically. This guide gets you from zero to running a model in under 10 minutes.
Requirements
- Windows 10 or 11 (64-bit)
- 8GB+ system RAM (16GB recommended)
- GPU with 4GB+ VRAM (NVIDIA recommended, AMD supported, CPU fallback available)
- 20GB+ free disk space for models
Step 1 — Download and Install
Go to ollama.com and click Download for Windows.
Run the installer. It installs Ollama as a background service that starts automatically with Windows. You'll see the Ollama icon in your system tray.
Alternatively, install via winget:
winget install Ollama.Ollama
Step 2 — Verify the Installation
Open PowerShell and run:
ollama --version
You should see a version number. If you get "command not found", restart your terminal — Ollama adds itself to PATH during installation.
Check that Ollama is running:
curl http://localhost:11434
# Should return: Ollama is running
Step 3 — Run Your First Model
Pull and run Llama 3.1 8B (recommended for 6GB+ VRAM):
ollama run llama3.1:8b
Ollama downloads the model file (~4.7GB) and starts a chat session. Type your message and press Enter.
For lower VRAM (4GB or less):
ollama run phi3:mini
For higher VRAM (12GB+):
ollama run llama3.1:8b:q8_0 # near-lossless quality
Step 4 — Verify GPU Usage
While a model is running, open a second PowerShell window:
nvidia-smi
You should see Ollama using GPU memory. If it shows 0 GPU usage, Ollama is running on CPU — see the troubleshooting section below.
Step 5 — Configure for Your Setup
Ollama uses environment variables for configuration. Set these in Windows System Environment Variables (System Properties → Environment Variables → System Variables → New):
| Variable | Value | Effect |
|---|---|---|
OLLAMA_NUM_GPU | 99 | Force all layers to GPU |
OLLAMA_KEEP_ALIVE | 300 | Keep model loaded 5 min |
OLLAMA_MAX_LOADED_MODELS | 1 | Only one model in VRAM at once |
OLLAMA_HOST | 0.0.0.0:11434 | Allow network access |
After changing environment variables, restart the Ollama service:
Stop-Process -Name "ollama" -Force
Start-Process "ollama" -ArgumentList "serve"
Or restart from the system tray icon.
Useful Commands
# List downloaded models
ollama list
# Pull a model without running it
ollama pull mistral:7b
# Remove a model
ollama rm llama3.1:8b
# Show model info
ollama show llama3.1:8b
# Run with a one-shot prompt (non-interactive)
ollama run llama3.1:8b "Explain quantization in one paragraph"
# List running models and VRAM usage
ollama ps
Recommended Models by VRAM
| VRAM | Model | Command |
|---|---|---|
| 4GB | Phi-3 Mini | ollama run phi3:mini |
| 6–8GB | Llama 3.1 8B | ollama run llama3.1:8b |
| 8GB | Mistral 7B | ollama run mistral:7b |
| 12GB | Llama 3.1 8B Q8 | ollama run llama3.1:8b:q8_0 |
| 24GB | Gemma 2 27B | ollama run gemma2:27b |
Troubleshooting
Model running on CPU instead of GPU
Check that your NVIDIA drivers are up to date (525+ required). Run nvidia-smi to verify the driver version. If drivers are current, set OLLAMA_NUM_GPU=99 in environment variables.
Out of memory error The model is too large for your VRAM. Try a smaller quantization:
ollama run llama3.1:8b:q4_0 # smaller than default Q4_K_M
Or use a smaller model.
Ollama not found after install Restart PowerShell or your terminal. The PATH update requires a new session.
Slow generation speed
Run ollama ps to confirm the model is on GPU. If VRAM is insufficient, Ollama automatically offloads some layers to CPU — this significantly reduces speed.
Next Steps
- Your First Model — how to choose the right model for your hardware
- Open WebUI Setup — get a ChatGPT-style interface
- Ollama API Guide — connect apps to your local model
- Finding Abliterated Models — run uncensored variants