TensorRT-LLM: Maximum Throughput on NVIDIA
TensorRT-LLM is NVIDIA's production inference library. It compiles models into optimised TensorRT engines, applies kernel fusion, and uses paged KV cache for continuous batching. On an RTX 4090, it delivers 160–200 tok/s on 7B models vs ~128 tok/s for ExLlamaV2. The tradeoff is setup complexity — this is not a beginner tool.
When to Use TensorRT-LLM
Use it when:
- You need maximum throughput for production API serving
- You're serving multiple concurrent users
- You have an NVIDIA GPU (Ada or Ampere) and want every bit of performance
Don't use it for:
- Single-user interactive chat (ExLlamaV2 is simpler and close enough)
- AMD or Apple Silicon (CUDA-only)
- Quick experimentation (compile times are 10–30 minutes)
Requirements
- NVIDIA GPU (RTX 30-series or newer, Ampere/Ada recommended)
- CUDA 12.2+
- Docker (strongly recommended — dependency management is complex)
- 32GB+ system RAM
- Python 3.10+
Install via Docker (Recommended)
# Pull the TensorRT-LLM container
docker pull nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
# Run with GPU access
docker run --gpus all -it --rm \
-v /your/model/path:/models \
-p 8000:8000 \
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 \
bash
Install from Source
If you prefer local install:
pip install tensorrt-llm --extra-index-url https://pypi.nvidia.com
Verify:
python -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
Step 1 — Download a HuggingFace Model
TensorRT-LLM works with HuggingFace format models, not GGUF. Download the original weights:
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='meta-llama/Meta-Llama-3.1-8B-Instruct',
local_dir='./hf-models/llama3.1-8b',
ignore_patterns=['*.gguf'],
)
"
For abliterated models, download the HuggingFace format version (not GGUF):
# FailSpy's abliterated Llama 3.1 8B
snapshot_download(
repo_id='failspy/Llama-3-8B-Instruct-abliterated',
local_dir='./hf-models/llama3-8b-abliterated',
)
Step 2 — Convert to TensorRT Engine
This is the compile step. It takes 10–30 minutes and must be repeated for each GPU/precision combination.
# Clone TensorRT-LLM examples
git clone https://github.com/NVIDIA/TensorRT-LLM
cd TensorRT-LLM/examples/llama
# Convert HF weights to TensorRT format
python convert_checkpoint.py \
--model_dir /models/hf-models/llama3.1-8b \
--output_dir /models/trt-ckpts/llama3.1-8b-fp16 \
--dtype float16
# Build the TensorRT engine
trtllm-build \
--checkpoint_dir /models/trt-ckpts/llama3.1-8b-fp16 \
--output_dir /models/trt-engines/llama3.1-8b-fp16 \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 2048 \
--max_seq_len 4096
Step 3 — INT8/INT4 Quantization (Optional)
For smaller VRAM footprint with near-fp16 quality:
# INT8 weight-only quantization
python convert_checkpoint.py \
--model_dir /models/hf-models/llama3.1-8b \
--output_dir /models/trt-ckpts/llama3.1-8b-int8 \
--dtype float16 \
--use_weight_only \
--weight_only_precision int8
trtllm-build \
--checkpoint_dir /models/trt-ckpts/llama3.1-8b-int8 \
--output_dir /models/trt-engines/llama3.1-8b-int8 \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 2048 \
--max_seq_len 4096
Step 4 — Run Inference
Direct Python inference:
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
from transformers import AutoTokenizer
ENGINE_DIR = "/models/trt-engines/llama3.1-8b-fp16"
TOKENIZER = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER)
runner = ModelRunner.from_dir(ENGINE_DIR, rank=0)
def generate(prompt: str, max_tokens: int = 200) -> str:
input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
outputs = runner.generate(
batch_input_ids=[input_ids[0]],
max_new_tokens=max_tokens,
temperature=0.7,
top_p=0.9,
)
return tokenizer.decode(outputs[0][0], skip_special_tokens=True)
result = generate("<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nExplain TensorRT-LLM in one paragraph<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n")
print(result)
Step 5 — Serve with Triton (OpenAI-Compatible API)
For production serving with an OpenAI-compatible endpoint:
# Prepare Triton model repository
python TensorRT-LLM/examples/llama/prepare_triton.py \
--engine_dir /models/trt-engines/llama3.1-8b-fp16 \
--model_repo /models/triton-repo \
--tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
# Start Triton server
tritonserver \
--model-repository /models/triton-repo \
--grpc-port 8001 \
--http-port 8000 \
--log-verbose 0
Test the API:
curl http://localhost:8000/v2/models/ensemble/generate \
-d '{
"text_input": "What is TensorRT-LLM?",
"max_tokens": 200,
"bad_words": "",
"stop_words": ""
}'
Performance Comparison
On RTX 4090, Llama 3.1 8B Q4/INT8:
| Backend | Tok/s | Setup Time | VRAM |
|---|---|---|---|
| TensorRT-LLM INT8 | 165 | 30 min | 8GB |
| ExLlamaV2 Q4_K_M | 128 | 5 min | 5.5GB |
| Ollama Q4_K_M | 98 | 1 min | 5.5GB |
| llama.cpp CUDA | 91 | 5 min | 5.5GB |
TensorRT-LLM wins on throughput but loses on setup simplicity. For production serving of multiple users simultaneously, the throughput advantage compounds — 4 concurrent users at 165 tok/s each vs 4 at 91 tok/s is a meaningful difference.
Multi-GPU with Tensor Parallelism
# Build engine for 2 GPUs with tensor parallelism
python convert_checkpoint.py \
--model_dir /models/hf-models/llama3.1-70b \
--output_dir /models/trt-ckpts/llama3.1-70b-tp2 \
--dtype float16 \
--tp_size 2 # Tensor parallel across 2 GPUs
trtllm-build \
--checkpoint_dir /models/trt-ckpts/llama3.1-70b-tp2 \
--output_dir /models/trt-engines/llama3.1-70b-tp2 \
--gemm_plugin float16 \
--max_batch_size 4 \
--max_input_len 2048 \
--max_seq_len 4096 \
--workers 2
Rebuilding After Changes
Any change to the model, quantization, batch size, or sequence length requires a full rebuild. Keep a script:
#!/bin/bash
# rebuild.sh
set -e
echo "Converting checkpoint..."
python convert_checkpoint.py \
--model_dir $1 \
--output_dir /tmp/trt-ckpt \
--dtype float16
echo "Building engine..."
trtllm-build \
--checkpoint_dir /tmp/trt-ckpt \
--output_dir $2 \
--gemm_plugin float16 \
--max_batch_size 8 \
--max_input_len 2048 \
--max_seq_len 4096
echo "Done: $2"
Troubleshooting
Out of memory during build
Reduce --max_batch_size. Build uses more memory than inference.
Slow build times
Normal — 10–30 minutes on first build per model/config. Subsequent builds of the same model are faster due to caching.
CUDA version mismatch
TensorRT-LLM is tightly coupled to CUDA version. Use the Docker container to avoid version conflicts.
Engine not loading
Engines are GPU-architecture specific. An engine built on Ada (RTX 40-series) will not run on Ampere (RTX 30-series) and vice versa.
Next Steps
- CUDA Optimization — maximize throughput on your specific GPU
- Inference Profiler — benchmark TensorRT vs other backends
- Dual GPU NVLink Setup — tensor parallel across consumer GPUs