ExLlamaV2 vs llama.cpp: Which Backend Is Faster in 2026?
A real-world throughput comparison of ExLlamaV2 and llama.cpp across GPU tiers and model sizes, with setup guides for both.
ExLlamaV2 vs llama.cpp: Which Backend Is Faster in 2026?
Two backends dominate local AI inference on consumer hardware: ExLlamaV2 and llama.cpp. They share GGUF format support but take different approaches to performance. This comparison covers real throughput numbers, setup complexity, and which to choose for your hardware.
Architecture Differences
llama.cpp is a C++ inference engine that runs on virtually any hardware — CUDA, ROCm, Vulkan, Metal, and CPU. It prioritises compatibility and portability. The GGUF format itself originates from llama.cpp and remains the standard for model distribution.
ExLlamaV2 is a Python library built specifically for NVIDIA GPUs. It implements custom CUDA kernels for dequantization and attention operations, tuned specifically for the Llama model architecture and its derivatives. It does not support CPU inference or AMD GPUs.
Throughput Comparison
All measurements at 512-token prompt, 256-token output, median of 5 runs, CUDA 12.4:
RTX 4090 (24GB)
| Model | Quant | llama.cpp | ExLlamaV2 | Difference |
|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | 98 tok/s | 128 tok/s | +31% |
| Llama 3.1 8B | Q8_0 | 71 tok/s | 94 tok/s | +32% |
| Mistral 7B | Q4_K_M | 112 tok/s | 141 tok/s | +26% |
| Gemma 2 27B | Q4_K_M | 44 tok/s | 57 tok/s | +30% |
Dual RTX 3090 NVLink (48GB)
| Model | Quant | llama.cpp | ExLlamaV2 | Difference |
|---|---|---|---|---|
| Llama 3.1 70B | Q4_K_M | 17.8 tok/s | 21.3 tok/s | +20% |
| Llama 3.1 70B | Q5_K_M | 13.4 tok/s | 16.1 tok/s | +20% |
| Mixtral 8x22B | Q4_K_M | 19.2 tok/s | 24.7 tok/s | +29% |
ExLlamaV2 consistently delivers 20-32% higher throughput on NVIDIA hardware across all tested configurations.
Memory Usage
ExLlamaV2 is also more memory efficient. It uses a more aggressive KV cache management strategy and supports dynamic cache sizing. On a 48GB system running Llama 3.1 70B Q4_K_M:
- llama.cpp: 40.2GB weights + 6.1GB KV cache = 46.3GB total at 4K context
- ExLlamaV2: 40.2GB weights + 4.8GB KV cache = 45.0GB total at 4K context
The difference is small but ExLlamaV2 leaves marginally more headroom for longer contexts.
Setup Complexity
llama.cpp is simpler to get running:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
./build/bin/llama-cli -m model.gguf -ngl 99 -p "Your prompt"
ExLlamaV2 requires Python and pip:
pip install exllamav2
# Then in Python:
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
from exllamav2.generator import ExLlamaV2DynamicGenerator
config = ExLlamaV2Config("/path/to/model")
model = ExLlamaV2(config)
model.load()
Ollama uses llama.cpp under the hood, so if you are using Ollama you are already on llama.cpp.
When to Use Each
Choose ExLlamaV2 if:
- You have an NVIDIA GPU
- Throughput is the priority
- You are comfortable with Python setup
- Running 70B+ where every tok/s matters
Choose llama.cpp if:
- You have an AMD GPU (ROCm) or Apple Silicon
- You want the simplest possible setup (via Ollama)
- You need CPU fallback for oversized models
- You are building on top of a backend with API support (llama.cpp has a built-in server mode)
llama.cpp server mode is worth noting — it exposes an OpenAI-compatible API endpoint out of the box:
./build/bin/llama-server -m model.gguf -ngl 99 --port 8080
ExLlamaV2 requires TabbyAPI or a custom wrapper to get equivalent API functionality.
Verdict
On NVIDIA hardware with no API requirements, ExLlamaV2 is the faster choice — consistently 20-30% better throughput with no quality difference. For everything else, or when simplicity matters, llama.cpp via Ollama remains the best default.