Understanding VRAM: Why It's Everything in Local AI

VRAM — Video RAM — is the memory on your GPU. It's the single most important factor in local AI. The model weights must fit in VRAM for fast inference. If they don't fit, the GPU offloads to system RAM, and inference slows down by 10–50×.

What VRAM Is

Your GPU has its own dedicated memory separate from your system RAM. It's faster than system RAM (typically 400–2000 GB/s vs 50–100 GB/s for DDR5) and physically located on the GPU card itself.

When you load a model, the weights — billions of floating-point numbers — get loaded into VRAM. Every time the model generates a token, it reads through those weights. The speed of that read is determined by VRAM bandwidth. This is why VRAM bandwidth matters as much as VRAM capacity for inference speed.

Why Models Need So Much VRAM

A 7B model has 7 billion parameters. At full precision (FP16, 2 bytes per parameter):

7,000,000,000 × 2 bytes = 14,000,000,000 bytes = ~14GB

That's 14GB just for the weights. Add KV cache for context and you're looking at 15–18GB for a 7B model at full precision on a 16K context.

This is why quantization exists — reducing precision from 16-bit to 4-bit drops that to ~5GB.

VRAM Requirements by Model Size and Quantization

Model	F16	Q8_0	Q5_K_M	Q4_K_M	Q2_K
3B	6GB	3.5GB	2.2GB	1.9GB	1.1GB
7B	14GB	7.5GB	5GB	4.2GB	2.5GB
8B	16GB	8.5GB	5.5GB	4.7GB	2.7GB
13B	26GB	14GB	9GB	7.5GB	4.5GB
27B	54GB	28GB	18GB	15GB	9GB
70B	140GB	70GB	47GB	40GB	22GB

Use the VRAM Calculator to get exact numbers for your setup.

KV Cache: The Hidden VRAM Cost

Beyond model weights, context (conversation history) takes VRAM in the form of KV cache. This grows with context length.

A rough formula: KV cache adds approximately 0.5–1GB per 4K tokens of context for a 7B model, scaling with model size.

At 4K context, KV cache is small. At 32K context it becomes significant. If you're doing long-document RAG or very long conversations, factor this in.

What Happens When the Model Doesn't Fit

When the model is too large for your VRAM, Ollama automatically offloads layers to system RAM. This works but is very slow.

Example: Llama 3.1 70B Q4_K_M (40GB) on an RTX 4090 (24GB):

24GB stays on GPU
16GB offloads to system RAM
Speed: ~3–5 tok/s instead of 21 tok/s

It's technically functional but not practical for interactive use.

Strategies for Running Larger Models

Strategy 1 — Lower quantization Drop from Q8_0 to Q4_K_M or lower. Quality decreases slightly but the model stays in VRAM.

Strategy 2 — Reduce context length KV cache uses VRAM proportional to context. Lowering from 8K to 2K context frees meaningful VRAM.

In Ollama:

ollama run llama3.1:8b --ctx-size 2048

Strategy 3 — Use a smaller model A well-quantized 13B model (Q5_K_M, ~9GB) often beats an offloaded 70B model in quality/speed tradeoff.

Strategy 4 — Add a GPU Two RTX 3090s with NVLink gives you 48GB. See the Dual GPU NVLink guide.

Strategy 5 — Use system RAM deliberately For batch processing where speed doesn't matter, full CPU inference at Q4_K_M on a 32B model via llama.cpp is viable. Slow (3–8 tok/s) but free and accurate.

VRAM vs System RAM

Property	VRAM	System RAM
Speed	400–2000 GB/s	50–100 GB/s
Capacity	4–80GB typical	16–256GB typical
Used for	Model weights, KV cache	OS, other apps, CPU offload
Cost	High per GB	Low per GB

You want the model in VRAM. System RAM is a fallback, not a target.

Monitoring VRAM Usage

# NVIDIA — live view
watch -n 1 nvidia-smi

# NVIDIA — just memory
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Ollama's view
ollama ps

How Much VRAM Should You Buy?

Budget	GPU	Runs
~$180 used	RTX 3060 12GB	7–13B Q4
~$400 used	RTX 3080 Ti 12GB	7–13B Q8
~$650 used	RTX 3090 24GB	27B Q5, any 7–13B
~$1,500 new	RTX 4090 24GB	27B Q8, fastest 7–13B
~$1,400 used	2× RTX 3090 NVLink	70B Q4

The RTX 3090 at ~$650 is the current sweet spot — 24GB fits nearly everything except 70B, and NVLink support means you can add a second one later for 70B capability.

Next Steps

VRAM Calculator — calculate requirements for your exact setup
Quant Picker — find the right quantization for your VRAM
Hardware Advisor — get a specific GPU recommendation
What is Quantization — understand the quality tradeoffs