DefiledAI Tools

TRUE VRAM CALCULATOR

Most VRAM calculators only count weights. This one computes the real footprint: weights + KV cache at your context length + runtime overhead. Uses the actual architectural parameters of each model.

Model Preset

Model Architecture

Parameters (B)

Layers

Hidden Size

KV Heads

Head Dim

Quantization

Good (≈1-2% loss)

Inference Parameters

Context Length (tokens)

Batch Size

Runtime Overhead (GB)

Hardware

GPU

Number of GPUs

Total VRAM Required

5.02 GB

✓ Fits with 18.98 GB headroom

Model weights4.02 GB

KV cache (4,096 ctx × 1 batch)0.50 GB

Runtime overhead0.50 GB

Available (1× RTX 4090)24 GB

Max Context on This Hardware

159,553 tokens

After weights + overhead, 19.48 GB remains for KV cache

All Quants Comparison

F1613.41 GB+0.50 KV14.41 GB ✓

Q8_07.12 GB+0.50 KV8.12 GB ✓

Q6_K5.53 GB+0.50 KV6.53 GB ✓

Q5_K_M4.78 GB+0.50 KV5.78 GB ✓

Q4_K_M4.02 GB+0.50 KV5.02 GB ✓

Q4_03.77 GB+0.50 KV4.77 GB ✓

Q3_K_M3.27 GB+0.50 KV4.27 GB ✓

Q2_K2.81 GB+0.50 KV3.81 GB ✓

IQ2_M2.26 GB+0.50 KV3.26 GB ✓

Formula

weights = params × bits / 8

kv_cache = 2 × layers × kv_heads × head_dim × ctx × batch × 2B

total = weights + kv_cache + overhead