DefiledAI

TUTORIALS

From first install to production MoE pipelines. Every tutorial is specific, tested, and written for people who actually run local AI.

BEGINNER

No prior experience needed. Get from zero to running your first local AI model.

What is Local AI and Why Run It Yourself?

5 min read

The case for local models: privacy, cost, censorship, and control.

Installing Ollama on Windows: Complete Guide

10 min read

Download, install, and run your first model in under 10 minutes.

Installing Ollama on Linux (Ubuntu/Debian)

8 min read

One-line install, systemd service, and first model.

Downloading and Running Your First Model

8 min read

Choose the right model for your GPU and run it with Ollama.

Understanding VRAM: Why It's Everything in Local AI

6 min read

What VRAM is, why it matters, and how to work within your limits.

Quantization Explained for Beginners

8 min read

Why Q4_K_M exists, what it costs you, and which format to pick.

INTERMEDIATE

You have Ollama running. Now optimise, extend, and connect it to real workflows.

Open WebUI: ChatGPT-Style Interface for Ollama

10 min read

Install Open WebUI with Docker and get a full chat interface.

Using the Ollama API: Build Your First Integration

15 min read

REST API, Python client, and OpenAI SDK compatibility.

Ollama Modelfiles: System Prompts, Parameters, Presets

12 min read

Create custom model personalities with persistent configuration.

ExLlamaV2 Setup: 20-30% Faster Than Ollama on NVIDIA

15 min read

Install, load a GGUF, and benchmark against Ollama.

llama.cpp Server Mode: Local OpenAI-Compatible API

12 min read

Run llama.cpp as a persistent API server.

Finding and Running Abliterated Models

10 min read

Where to find abliterated GGUFs, how to load them, what to expect.

EXPERT

Advanced inference, multi-GPU, production serving, and building AI pipelines.

Dual GPU NVLink Setup for 70B Inference

20 min read

Hardware, driver config, and running 70B models on dual 3090s.

Building a Local MoE Pipeline from Independent Models

25 min read

Router + expert + synthesizer architecture from scratch with Python.

TensorRT-LLM: Maximum Throughput on NVIDIA

30 min read

Compile, quantize, and serve with TensorRT-LLM.

Building a Local RAG Pipeline with Ollama

20 min read

Document ingestion, vector search, and retrieval-augmented generation.

CUDA Optimization for LLM Inference

20 min read

Flash attention, KV cache tuning, batch sizing, and profiling.

Run Abliteration Yourself: Step-by-Step Guide

25 min read

Apply representation engineering to any open-weight model locally.