DefiledAI Research
QUANTIZATION GUIDE
GGUF quantization lets you run large models on consumer hardware by reducing weight precision. This guide covers every major format, their quality tradeoffs, and how to choose the right one for your hardware.
WHAT IS QUANTIZATION?
Neural network weights are typically stored as 16-bit or 32-bit floating point numbers. Quantization reduces each weight to fewer bits — trading a small amount of model quality for dramatically lower VRAM usage and faster inference.
The GGUF format (used by llama.cpp, Ollama, LM Studio, and ExLlamaV2) supports a wide range of quantization levels. K-quants (Q4_K_M, Q5_K_M etc.) use a mixed-precision approach that preserves quality in the most important layers.
Format Comparison
| Format | Bits | VRAM vs F16 | Quality | Speed | Notes |
|---|---|---|---|---|---|
| F16 | 16 | 1.0× | 100 | 60 | Highest fidelity. Only viable for small models on high-VRAM cards. |
| Q8_0 | 8 | 0.5× | 99 | 75 | Near-lossless. Best quality/size tradeoff for models that fit in VRAM. |
| Q6_K | 6 | 0.38× | 98 | 82 | Excellent quality with meaningful VRAM savings. Recommended for 13B. |
| Q5_K_M | 5 | 0.31× | 96 | 88 | Strong quality. Good default for 7-13B models when VRAM is limited. |
| Q4_K_M | 4 | 0.25× | 92 | 95 | Most popular. Best balance of quality, speed, and VRAM for 70B class. |
| Q3_K_M | 3 | 0.19× | 83 | 100 | Noticeable quality degradation. Use only when VRAM is severely constrained. |
| IQ3_M | ~3.5 | 0.22× | 87 | 92 | Importance-matrix quantization. Better quality than Q3_K_M at similar size. |
| Q2_K | 2 | 0.13× | 65 | 100 | Severe quality loss. Last resort for fitting very large models on limited VRAM. |
| IQ1_M | ~1.5 | 0.09× | 45 | 100 | Extreme compression. Only useful for 405B/671B models on consumer hardware. |
VRAM Requirements by Model Size
| Model | F16 | Q8_0 | Q5_K_M | Q4_K_M | Q2_K |
|---|---|---|---|---|---|
| 7B | 14GB | 7GB | 5GB | 4GB | 2.5GB |
| 13B | 26GB | 13GB | 9GB | 7GB | 4GB |
| 30B | 60GB | 30GB | 22GB | 17GB | 9GB |
| 70B | 140GB | 70GB | 48GB | 40GB | 22GB |
| 405B | 810GB | 405GB | 280GB | 220GB | 110GB |
QUICK RECOMMENDATION
Consumer GPU (≤24GB)
Use Q4_K_M for 7-13B models. For 70B you'll need dual GPUs or NVLink.
Dual GPU (48GB)
Run 70B at Q4_K_M comfortably. Q5_K_M if you want better quality at 56GB.
Quality Priority
Always use the highest quant that fits. Q6_K or Q8_0 for smaller models if VRAM allows.