DefiledAI Research
HARDWARE CONFIGS
Curated build recommendations for local AI inference at every budget tier, plus a GPU reference matrix sorted by inference performance.
Entry — 7B Workhorse
~$800
MAX
13B Q4_K_M
GPURTX 3080 10GB
CPURyzen 5 5600X
RAM32GB DDR4
Storage1TB NVMe
Speed~55 tok/s (7B Q4)
Best entry point. 10GB VRAM handles most 7B models in Q8 and 13B in Q4.
Mid — 30B Sweet Spot
~$1,400
MAX
30B Q4_K_M
GPURTX 4090 24GB
CPURyzen 7 7700X
RAM64GB DDR5
Storage2TB NVMe
Speed~112 tok/s (7B Q4)
The current single-card king for inference. Handles 30B comfortably, 34B with some quant compromise.
High — 70B Capable
~$2,200
MAX
70B Q4_K_M
GPU2× RTX 3090 24GB (NVLink)
CPURyzen 9 7950X
RAM128GB DDR5
Storage4TB NVMe
Speed~21 tok/s (70B Q4)
NVLink required for full 48GB pool. Without NVLink you get CPU offload which tanks speed.
Workstation — 70B+ Fast
~$6,000
MAX
70B Q5_K_M
GPU2× RTX 4090 24GB
CPUThreadripper 7960X
RAM256GB DDR5 ECC
Storage8TB NVMe RAID
Speed~35 tok/s (70B Q4)
No NVLink on 40-series consumer cards — uses PCIe peer-to-peer. Still the fastest consumer 70B setup.
Server — MoE & 405B
~$15,000+
MAX
405B Q4 / DeepSeek V3
GPU4× A100 80GB SXM
CPUDual EPYC 9354
RAM512GB DDR5 ECC
Storage16TB NVMe
Speed~39 tok/s (DeepSeek V3)
NVLink/NVSwitch fabric. Required for 405B and large MoE models at usable speeds.
GPU Reference
| GPU | VRAM | Bandwidth | TFLOPs | TDP | PCIe | Inference Score |
|---|---|---|---|---|---|---|
| RTX 4090 | 24GB | 1.0 TB/s | 82.6 | 450W | x16 4.0 | 100 |
| RTX 3090 | 24GB | 0.94 TB/s | 35.6 | 350W | x16 4.0 | 72 |
| RTX 4080 | 16GB | 0.72 TB/s | 48.7 | 320W | x16 4.0 | 78 |
| RTX 3080 Ti | 12GB | 0.91 TB/s | 34.1 | 350W | x16 4.0 | 68 |
| RX 7900 XTX | 24GB | 0.96 TB/s | 61.4 | 355W | x16 4.0 | 65 |
| A100 80GB | 80GB | 2.0 TB/s | 77.9 | 400W | SXM5 | 95 |
* Inference score weighted toward memory bandwidth (primary bottleneck for LLM token generation).