DefiledAI
RESOURCES
Guides, references, and tools for running AI locally. From first setup to multi-GPU optimization.
Getting Started
BeginnerOllamallama.cpp
Local Inference Setup Guide
Install Ollama or llama.cpp, download your first model, and run inference on any consumer GPU.
BeginnerModelsQuantization
Choosing Your First Model
A decision tree for picking the right model family, size, and quantization based on your hardware.
BeginnerVRAMHardware
VRAM Planning Guide
Calculate exactly how much VRAM you need before downloading multi-gigabyte model files.
Quantization
IntermediateGGUFQuality
GGUF Quantization Explained
Deep dive into K-quants, importance matrix quants, and how to choose between Q4_K_M, IQ3_M, and others.
IntermediateBenchmarks
Q4_K_M vs IQ3_M Quality Analysis
Side-by-side perplexity scores and real-world output comparisons across 7B, 13B, and 70B models.
Performance
AdvancedCUDAPerformance
CUDA Optimization for Inference
Flash attention, KV cache tuning, batch size, and context length settings that actually move the needle.
AdvancedMulti-GPUNVLink
Multi-GPU Scaling Guide
Tensor parallelism, NVLink vs PCIe P2P, and when to use pipeline vs model parallelism.
IntermediateBackendsBenchmarks
ExLlamaV2 vs llama.cpp — Which is Faster?
Backend comparison with real throughput numbers across GPU tiers and model sizes.
Tools & References
ToolVRAM
Model VRAM Calculator
Enter model parameters and quantization to instantly calculate VRAM requirements.
ReferenceGPU
GPU Inference Comparison Matrix
Every major consumer and prosumer GPU ranked by inference throughput and VRAM capacity.
ReferenceGGUF
Quantization Format Reference
Quick reference table for all GGUF quantization formats with bits, quality scores, and use cases.
External Tools
Hugging Face↗
Model hub — download GGUF files directly
llama.cpp↗
CPU/GPU inference backend, GGUF format origin
Ollama↗
Easiest local model runner for beginners
ExLlamaV2↗
Fastest GGUF inference backend for NVIDIA GPUs
LM Studio↗
GUI for local model management and inference
Open WebUI↗
Web interface for Ollama — ChatGPT-style UI