Q4_K_M vs IQ3_M: Quantization Quality Analysis

The choice between Q4_K_M and IQ3_M is one of the most common decisions in local AI inference. Both target the 70B parameter class where VRAM is constrained, but they take fundamentally different approaches to compression. This analysis covers the technical differences, measured quality loss, and practical guidance.

What Each Format Actually Does

Q4_K_M is a K-quant format that quantizes weights to 4 bits using a block-wise scaling approach. The "K" indicates it uses a mixed-precision strategy — some layers are quantized at higher precision to preserve quality in the most sensitive parts of the network. The "M" variant uses medium-sized blocks, balancing speed and quality against the smaller "S" variant.

IQ3_M is an importance-matrix quantization format. Rather than applying uniform 3-bit quantization across all weights, it uses a calibration dataset to identify which weights matter most and assigns higher precision to those weights. The result is approximately 3.5-bit average precision with quality that punches above its size.

Size Comparison on 70B Models

Format	Size (70B)	VRAM Required	vs F16
F16	140GB	140GB	baseline
Q8_0	70GB	70GB	0.50×
Q4_K_M	40GB	40GB	0.29×
IQ3_M	31GB	31GB	0.22×
Q2_K	26GB	26GB	0.19×

IQ3_M saves roughly 9GB versus Q4_K_M on a 70B model — meaningful when you are trying to fit within a 40GB VRAM ceiling or leave more headroom for KV cache.

Perplexity Scores

Perplexity measures how well a model predicts text — lower is better. Measured on WikiText-2:

Model	Format	Perplexity	Delta vs F16
Llama 3.1 70B	F16	2.84	—
Llama 3.1 70B	Q4_K_M	2.91	+0.07
Llama 3.1 70B	IQ3_M	2.97	+0.13
Llama 3.1 70B	Q3_K_M	3.18	+0.34
Llama 3.1 70B	Q2_K	3.89	+1.05

IQ3_M lands between Q4_K_M and Q3_K_M in perplexity — significantly better than Q3_K_M despite being a similar size, and notably better than Q2_K. The gap to Q4_K_M is measurable but small.

Real-World Output Differences

Perplexity does not always predict subjective quality. In practice:

Factual recall — both formats perform similarly on straightforward factual questions. The model's knowledge base is largely intact in both.

Long-form reasoning — Q4_K_M has a measurable edge on multi-step reasoning tasks. IQ3_M occasionally loses thread on complex chains of logic spanning more than 5-6 steps.

Code generation — Q4_K_M produces more reliable code. IQ3_M introduces occasional syntax errors and logic gaps in longer functions. For short utility functions both are comparable.

Creative writing — virtually indistinguishable at normal output lengths. For very long-form content IQ3_M can drift slightly more.

When to Use Each

Use Q4_K_M when:

You have 40GB+ VRAM available
You are doing code generation or complex reasoning
Quality is the priority and you have the VRAM headroom

Use IQ3_M when:

You need to fit 70B within 32-36GB VRAM
Your use case is conversational or factual Q&A
You want more KV cache headroom on a 40GB system

Avoid IQ3_M for:

Production code generation
Long multi-step reasoning chains
Tasks where Q4_K_M fits comfortably

Inference Speed

IQ3_M is marginally slower than Q4_K_M on most backends due to the dequantization overhead of the importance-matrix approach. The difference is typically 5-10% on ExLlamaV2 and slightly larger on llama.cpp.

Bottom Line

If you have 40GB VRAM, use Q4_K_M. If you need to squeeze into 32GB or want more KV cache room on a 40GB system, IQ3_M is the right call — it is significantly better than Q3_K_M and the quality trade-off versus Q4_K_M is acceptable for most use cases. Avoid Q3_K_M entirely; IQ3_M is strictly better at a similar size.