TUTORIALS
From first install to production MoE pipelines. Every tutorial is specific, tested, and written for people who actually run local AI.
No prior experience needed. Get from zero to running your first local AI model.
What is Local AI and Why Run It Yourself?
5 min readThe case for local models: privacy, cost, censorship, and control.
Installing Ollama on Windows: Complete Guide
10 min readDownload, install, and run your first model in under 10 minutes.
Installing Ollama on Linux (Ubuntu/Debian)
8 min readOne-line install, systemd service, and first model.
Downloading and Running Your First Model
8 min readChoose the right model for your GPU and run it with Ollama.
Understanding VRAM: Why It's Everything in Local AI
6 min readWhat VRAM is, why it matters, and how to work within your limits.
Quantization Explained for Beginners
8 min readWhy Q4_K_M exists, what it costs you, and which format to pick.
You have Ollama running. Now optimise, extend, and connect it to real workflows.
Open WebUI: ChatGPT-Style Interface for Ollama
10 min readInstall Open WebUI with Docker and get a full chat interface.
Using the Ollama API: Build Your First Integration
15 min readREST API, Python client, and OpenAI SDK compatibility.
Ollama Modelfiles: System Prompts, Parameters, Presets
12 min readCreate custom model personalities with persistent configuration.
ExLlamaV2 Setup: 20-30% Faster Than Ollama on NVIDIA
15 min readInstall, load a GGUF, and benchmark against Ollama.
llama.cpp Server Mode: Local OpenAI-Compatible API
12 min readRun llama.cpp as a persistent API server.
Finding and Running Abliterated Models
10 min readWhere to find abliterated GGUFs, how to load them, what to expect.
Advanced inference, multi-GPU, production serving, and building AI pipelines.
Dual GPU NVLink Setup for 70B Inference
20 min readHardware, driver config, and running 70B models on dual 3090s.
Building a Local MoE Pipeline from Independent Models
25 min readRouter + expert + synthesizer architecture from scratch with Python.
TensorRT-LLM: Maximum Throughput on NVIDIA
30 min readCompile, quantize, and serve with TensorRT-LLM.
Building a Local RAG Pipeline with Ollama
20 min readDocument ingestion, vector search, and retrieval-augmented generation.
CUDA Optimization for LLM Inference
20 min readFlash attention, KV cache tuning, batch sizing, and profiling.
Run Abliteration Yourself: Step-by-Step Guide
25 min readApply representation engineering to any open-weight model locally.