DefiledAI Research

QUANTIZATION GUIDE

GGUF quantization lets you run large models on consumer hardware by reducing weight precision. This guide covers every major format, their quality tradeoffs, and how to choose the right one for your hardware.

WHAT IS QUANTIZATION?

Neural network weights are typically stored as 16-bit or 32-bit floating point numbers. Quantization reduces each weight to fewer bits — trading a small amount of model quality for dramatically lower VRAM usage and faster inference.

The GGUF format (used by llama.cpp, Ollama, LM Studio, and ExLlamaV2) supports a wide range of quantization levels. K-quants (Q4_K_M, Q5_K_M etc.) use a mixed-precision approach that preserves quality in the most important layers.

Format Comparison

Format	Bits	VRAM vs F16	Quality	Speed	Notes
F16	16	1.0×	100	60	Highest fidelity. Only viable for small models on high-VRAM cards.
Q8_0	8	0.5×	99	75	Near-lossless. Best quality/size tradeoff for models that fit in VRAM.
Q6_K	6	0.38×	98	82	Excellent quality with meaningful VRAM savings. Recommended for 13B.
Q5_K_M	5	0.31×	96	88	Strong quality. Good default for 7-13B models when VRAM is limited.
Q4_K_M	4	0.25×	92	95	Most popular. Best balance of quality, speed, and VRAM for 70B class.
Q3_K_M	3	0.19×	83	100	Noticeable quality degradation. Use only when VRAM is severely constrained.
IQ3_M	~3.5	0.22×	87	92	Importance-matrix quantization. Better quality than Q3_K_M at similar size.
Q2_K	2	0.13×	65	100	Severe quality loss. Last resort for fitting very large models on limited VRAM.
IQ1_M	~1.5	0.09×	45	100	Extreme compression. Only useful for 405B/671B models on consumer hardware.

VRAM Requirements by Model Size

Model	F16	Q8_0	Q5_K_M	Q4_K_M	Q2_K
7B	14GB	7GB	5GB	4GB	2.5GB
13B	26GB	13GB	9GB	7GB	4GB
30B	60GB	30GB	22GB	17GB	9GB
70B	140GB	70GB	48GB	40GB	22GB
405B	810GB	405GB	280GB	220GB	110GB

QUICK RECOMMENDATION

Consumer GPU (≤24GB)

Use Q4_K_M for 7-13B models. For 70B you'll need dual GPUs or NVLink.

Dual GPU (48GB)

Run 70B at Q4_K_M comfortably. Q5_K_M if you want better quality at 56GB.

Quality Priority

Always use the highest quant that fits. Q6_K or Q8_0 for smaller models if VRAM allows.