DefiledAI Research

BENCHMARK MATRIX

Real-world inference benchmarks measured on consumer and prosumer hardware. All results are first-token-excluded sustained throughput at default sampling settings.

Methodology
MetricSustained tok/s, excluding first token (TTFT)
Prompt512-token fixed input, 256-token output
Runs5 iterations, median reported
DriverCUDA 12.4 / ROCm 6.1
Inference ResultsLast updated: 2026-05-28
ModelQuantVRAMBackendGPUTok/sDate
Llama 3.1 70BQ4_K_M48GBExLlamaV22× RTX 3090212026-05-28
Llama 3.1 70BQ5_K_M56GBExLlamaV22× RTX 3090162026-05-28
Qwen 3 72BQ5_K_M64GBllama.cpp2× RTX 3090182026-05-27
DeepSeek V3MoE Q4Multi-GPUTensorRT-LLM4× A100392026-05-26
Mixtral 8x22BQ448GBExLlamaV22× RTX 3090272026-05-25
Phi-3 MediumQ6_K14GBllama.cppRTX 4090682026-05-24
Gemma 2 27BQ4_K_M18GBllama.cppRTX 4090442026-05-23
Mistral 7BQ8_08GBllama.cppRTX 30801122026-05-22
GPU Comparison — Llama Family (tok/s)
GPUVRAM7B Q413B Q470B Q4Street Price
RTX 409024GB11272OOM$1,600
RTX 309024GB8958OOM$700
2× RTX 309048GB946121$1,400
RTX 408016GB9861OOM$1,000
RX 7900 XTX24GB7144OOM$800
Submit your benchmark results to the forum.