DeepSeek V3: Running a 671B MoE Model Locally

DeepSeek V3 is one of the most significant open-weight model releases in recent years — a 671B parameter Mixture-of-Experts architecture under MIT license that benchmarks competitively with frontier closed models. Running it locally is not trivial, but the hardware requirements are more accessible than the parameter count suggests.

What Makes MoE Different

A standard dense model like Llama 3.1 70B activates all 70 billion parameters for every token it generates. A Mixture-of-Experts model activates only a subset of its parameters — the "experts" selected by a router for each token. DeepSeek V3 activates approximately 37B parameters per token despite having 671B total parameters.

This has two practical implications:

VRAM still requires fitting all weights — you need enough memory to load the full 671B parameter set even though only 37B are active at any time
Inference speed is faster than parameter count implies — compute per token is closer to a 37B dense model than 671B

Hardware Requirements

At Q4_K_M quantization, DeepSeek V3 requires approximately 380GB of VRAM. This puts it firmly in multi-GPU server territory:

Setup	VRAM	Can Fit	Speed
4× RTX 3090	96GB	No	—
8× RTX 3090	192GB	No (tight)	—
4× A100 80GB	320GB	Q2_K only	~20 tok/s
8× A100 80GB	640GB	Q4_K_M	~39 tok/s
4× H100 80GB	320GB	Q2_K only	~35 tok/s

For most users, DeepSeek V3 is a cloud inference model. The local story only makes sense if you have a server-grade multi-GPU setup.

IQ1_M: The Consumer Hardware Option

At IQ1_M quantization (~1.5 bits), DeepSeek V3 compresses to approximately 120GB. This theoretically fits across:

5× RTX 3090 (120GB)
3× RTX 4090 (72GB — tight, needs system RAM offload)

The quality trade-off at IQ1_M is severe. Perplexity scores increase substantially and outputs become less coherent on complex reasoning tasks. That said, for conversational use and factual Q&A on a model of this capability, even heavily quantized DeepSeek V3 produces impressive results.

DeepSeek R1 vs V3

If you are choosing between the two DeepSeek models for local use:

DeepSeek R1 is the reasoning-focused variant. It uses chain-of-thought internally and excels at math, logic, and structured problem solving. Available in 7B, 14B, 32B, and 70B sizes — the 70B is the most capable and fits in the same hardware as Llama 3.1 70B.

DeepSeek V3 is the general-purpose large model. Better for coding, creative tasks, and broad knowledge. Only relevant if you have 300GB+ VRAM.

For most users, DeepSeek R1 70B is the practical choice. It fits on dual RTX 3090 NVLink and delivers reasoning performance that rivals much larger models on structured tasks.

Running DeepSeek R1 70B Locally

# Via Ollama
ollama pull deepseek-r1:70b
ollama run deepseek-r1:70b

# Via ExLlamaV2 (GGUF)
python test_inference.py -m deepseek-r1-70b-Q4_K_M.gguf -gs 24,24

Benchmark Results: DeepSeek R1 70B

Measured on dual RTX 3090 NVLink, ExLlamaV2:

Task	DeepSeek R1 70B	Llama 3.1 70B
MATH-500	94.1%	68.3%
HumanEval (code)	82.3%	80.1%
GPQA (science)	71.2%	62.4%
Throughput	19.2 tok/s	21.3 tok/s

For reasoning and math tasks, DeepSeek R1 70B is the stronger choice. Llama 3.1 70B edges it slightly on throughput and general text generation.

Conclusion

DeepSeek V3 at full scale is server hardware only. DeepSeek R1 70B is the practical local inference option and one of the strongest models at its size class — particularly for reasoning-heavy workloads. If your dual 3090 NVLink system is running Llama 3.1 70B, DeepSeek R1 70B is worth adding to your rotation.