Dual GPU NVLink Setup for 70B Local Inference
A single consumer GPU tops out at 24GB VRAM — not enough for 70B Q4_K_M (40GB). Dual RTX 3090 with NVLink bridges the gap: two 24GB cards become a single 48GB pool, enough for 70B Q4_K_M with 8GB headroom for KV cache.
What NVLink Does
Without NVLink, splitting a model across two GPUs routes data through PCIe and system RAM. This creates a bottleneck — inference slows to ~5 tok/s on 70B.
With NVLink, the two GPUs share a 600 GB/s bidirectional interconnect and the OS sees them as one 48GB device. Inference runs at ~21 tok/s — usable for interactive chat.
Hardware Requirements
Two RTX 3090s — must be the same brand (both ASUS, both EVGA, both Founders Edition). Mismatched cooling can cause thermal issues. Founders Edition cards are ideal.
NVLink Bridge — the physical bridge connecting the two cards. RTX 3090 uses a 3-slot bridge for most AIB cards, 4-slot for Founders Edition. Check your specific card's spacing.
Motherboard — needs two PCIe x16 slots running at full x16 electrical bandwidth. Many boards have x16/x4 — the x4 slot kills bandwidth. Check your board's spec sheet carefully.
PSU — two RTX 3090s draw up to 700W combined. Add CPU, drives, overhead: 1200W minimum, 1600W recommended. Use a single PSU, not two.
Case — full tower. The NVLink bridge occupies the space between the cards, which must be in adjacent slots. You need 7–8 expansion slots of space.
Building the System
Step 1 — Install cards
Install both RTX 3090s in the two x16 slots. They should be adjacent (no gap between them). Tighten the screws — NVLink bridges add lateral force.
Step 2 — Install the NVLink bridge
With the PC powered off, press the NVLink bridge onto the gold connectors on top of both cards. It clicks into place. Don't force it — if it doesn't seat smoothly, check you have the right bridge for your card spacing.
Step 3 — Connect power
Each RTX 3090 needs two 8-pin (or 16-pin on newer cards) PCIe power connectors. Use separate cables from the PSU to each card — don't daisy chain.
Driver and Software Setup
Verify NVLink detection:
# Linux
nvidia-smi nvlink --status
# Should show: Link 0: Active for both GPUs
# Windows PowerShell
nvidia-smi nvlink --status
Check unified memory pool:
nvidia-smi topo -m
# Should show NV2 or NV4 connection between GPU 0 and GPU 1
# NV2 = 2 NVLink lanes, NV4 = 4 NVLink lanes (RTX 3090 = NV2)
Verify VRAM:
nvidia-smi --query-gpu=name,memory.total --format=csv
# Should show two entries, each 24576 MiB
Running 70B Models
Ollama — detects NVLink automatically, no configuration needed:
ollama pull llama3.1:70b
ollama run llama3.1:70b
# Ollama uses the full 48GB pool automatically
ExLlamaV2:
python test_inference.py \
-m /path/to/Meta-Llama-3.1-70B-Q4_K_M.gguf \
-gs 24,24 \
-t 300
llama.cpp:
./build/bin/llama-server \
-m /path/to/Meta-Llama-3.1-70B-Q4_K_M.gguf \
-ngl 83 \
--tensor-split 1,1 \
--ctx-size 4096 \
--port 8080
Expected Performance
Benchmarks on dual RTX 3090 NVLink, ExLlamaV2, CUDA 12.4:
| Model | Quant | Tok/s |
|---|---|---|
| Llama 3.1 70B | Q4_K_M | 21.3 |
| Llama 3.1 70B | Q5_K_M | 16.1 |
| Qwen 2.5 72B | Q4_K_M | 19.8 |
| DeepSeek R1 70B | Q4_K_M | 19.2 |
| Mixtral 8x22B | Q4_K_M | 24.7 |
21 tok/s is fast enough for interactive chat. It's not instant like a 7B model on a 4090, but it's usable.
Thermal Management
The primary challenge is heat. The NVLink bridge blocks airflow between cards and the bottom card runs 15–20°C hotter than the top.
Solutions in order of effectiveness:
- Remove the side panel during inference — immediate, free, works well
- Undervolt both GPUs — MSI Afterburner, 900mV core at 1800MHz reduces heat by ~30W per card
- Replace thermal pads — RTX 3090 VRAM throttles above 105°C, common on used cards
- Dedicated intake fan — pointing directly at the gap between cards
Target temperatures: GPU core below 83°C, VRAM below 100°C under sustained load.
Troubleshooting
NVLink not detected
- Reseat the bridge with the PC off
- Update NVIDIA drivers to 525+
- Check BIOS PCIe settings — enable "PCIe 4.0" and "Above 4G Decoding"
Only one GPU being used
Set the correct flags: -gs 24,24 in ExLlamaV2, --tensor-split 1,1 in llama.cpp. Ollama handles this automatically.
Speed slower than expected
Check that both GPUs are running at x16 bandwidth:
nvidia-smi --query-gpu=pcie.link.width.current --format=csv
# Both should show 16
If one shows 4, your motherboard is running the second slot at x4. Check your board's manual — some boards only run x16/x4 unless a specific BIOS setting is enabled.
Temperatures too high
Undervolting is the most effective fix. In MSI Afterburner: Ctrl+F to open voltage/frequency curve editor, lock the 900mV point at 1800MHz, apply. Reduces power draw by 50–80W per card.
Next Steps
- CUDA Optimization — squeeze more performance from your setup
- Inference Profiler — benchmark your exact configuration
- Speed Estimator — predict throughput before buying hardware