Quantization Explained for Beginners
Quantization is how you fit a large AI model into a small GPU. It compresses the model weights by reducing numerical precision — trading a small amount of quality for dramatically lower memory usage and faster inference. This guide explains what it means in practice.
The Problem Quantization Solves
A 70B model at full precision (FP16) requires 140GB of VRAM. No consumer GPU has that. Quantization compresses it to 40GB (Q4_K_M) — small enough for two RTX 3090s.
The compression works by reducing how precisely each weight is stored:
- FP16 — 16 bits per weight, full precision
- Q8_0 — 8 bits per weight, ~0.5× the size
- Q4_K_M — ~4.8 bits per weight, ~0.25× the size
- Q2_K — ~2.6 bits per weight, ~0.13× the size
What the Letters Mean
GGUF quantization names follow a pattern: Q[bits]_[type][variant]
Q4 — 4-bit quantization (approximately) K — K-quant: uses mixed precision, preserving quality in critical layers M — Medium block size (larger = slightly better quality, more VRAM) S — Small block size (smaller, slightly lower quality)
So Q4_K_M = 4-bit mixed-precision quantization with medium block size.
IQ formats (IQ3_M, IQ4_NL) use importance matrices — they identify which weights matter most and quantize them at higher precision. Generally better quality than plain Q formats at the same size.
Quality and Size at Each Level
For a 70B model:
| Format | Size | Quality | Use When |
|---|---|---|---|
| F16 | 140GB | 100% | You have datacenter hardware |
| Q8_0 | 70GB | 99% | 80GB+ VRAM available |
| Q6_K | 54GB | 98% | 60GB+ VRAM |
| Q5_K_M | 47GB | 96% | 56GB+ VRAM |
| Q4_K_M | 40GB | 92% | 48GB VRAM (dual 3090) |
| IQ3_M | 32GB | 87% | Tight 40GB systems |
| Q2_K | 22GB | 65% | Last resort only |
For a 7B model:
| Format | Size | Quality | Use When |
|---|---|---|---|
| F16 | 14GB | 100% | 16GB+ VRAM available |
| Q8_0 | 7.5GB | 99% | 8GB VRAM |
| Q6_K | 5.5GB | 98% | 6–8GB VRAM, quality priority |
| Q5_K_M | 5GB | 96% | 6GB VRAM, balanced |
| Q4_K_M | 4.2GB | 92% | 4–6GB VRAM |
| Q2_K | 2.5GB | 65% | 4GB only, last resort |
The Quality Difference in Practice
At Q4_K_M the model scores about 92% of F16 on standard benchmarks. In everyday use, the difference is often imperceptible — for coding, writing, and general Q&A you typically cannot tell the difference between Q4_K_M and Q8_0 outputs.
The gap becomes more noticeable at:
- Q3 and below — outputs can drift on complex reasoning
- Very long responses — quality degradation accumulates
- Precise mathematical calculations — rounding errors compound
For most use cases, Q4_K_M is the right default. If you have the VRAM headroom, Q5_K_M or Q6_K is a meaningful quality upgrade.
The Quick Rule
Use the highest quant that comfortably fits in your VRAM.
Leave about 2GB headroom for KV cache and overhead. If your GPU has 12GB, target models that use ~10GB maximum.
How Ollama Handles This
Ollama picks the quantization automatically based on your available VRAM. When you run ollama run llama3.1:8b, it pulls the Q4_K_M variant by default because it fits the widest range of hardware.
To specify a quant manually:
# Near-lossless
ollama run llama3.1:8b:q8_0
# Smaller/faster
ollama run llama3.1:8b:q4_0
# Default (Q4_K_M)
ollama run llama3.1:8b
Downloading Specific Quants from HuggingFace
If you're using ExLlamaV2 or llama.cpp directly, you download GGUF files from HuggingFace. Search for [model name] GGUF — bartowski is the most reliable quantizer with full quant packs.
File naming example:
Llama-3.1-8B-Instruct-Q4_K_M.gguf ← what you want for 6GB GPU
Llama-3.1-8B-Instruct-Q8_0.gguf ← what you want for 8GB GPU
Llama-3.1-8B-Instruct-Q5_K_M.gguf ← middle ground
Common Mistakes
Using Q2_K when Q4 would fit — Q2 quality is significantly degraded. Always use the highest quant that fits.
Not accounting for KV cache — the model file size is not the total VRAM used. A 4.7GB model file might use 5.5GB total with KV cache at 4K context.
Downloading Q8 when you only have 8GB — Q8 of an 8B model (~7.5GB) barely fits an 8GB GPU. Use Q5_K_M or Q6_K instead for headroom.
Next Steps
- VRAM Calculator — calculate exact requirements including KV cache
- Quant Picker — 3-question wizard to find your format
- Quantization Guide — full technical reference with perplexity scores