HomeTutorialsexpert
expertExpert Tutorial

Building a Local MoE Pipeline from Independent Models

Build a macro-scale Mixture-of-Experts pipeline using independent local models — a router, domain experts, and optional synthesizer — entirely on consumer hardware.

2026-05-306 min read
moepipelinerouterollamapythonarchitecture

Building a Local MoE Pipeline from Independent Models

A Mixture-of-Experts architecture routes each input to a specialist rather than running every parameter. Traditional MoE (Mixtral, DeepSeek V3) does this inside a single model at the layer level. This guide builds a macro-scale MoE from independent local models — each an expert in its domain — orchestrated by a router and optionally refined by a synthesizer.

Why This Matters

Instead of loading one massive 70B dense model, you load:

  • A router (7B) — always in VRAM, classifies intent
  • A coding expert (32B) — loaded on demand for code tasks
  • A reasoning expert (14B) — loaded for math and logic
  • A writing expert (70B) — loaded for long-form creative tasks
  • A synthesizer (7B) — optional, integrates multi-domain outputs

Peak VRAM usage is router + one expert + synthesizer, not all models simultaneously. On a 24GB GPU you can access capability that would normally require 150GB.

Architecture

User Prompt
     ↓
  Router (Mistral 7B)
  └─ classifies domain
  └─ decides if synthesis needed
     ↓
Expert Dispatch
  ├─ coding    → Qwen 2.5 Coder 32B
  ├─ reasoning → DeepSeek R1 14B
  ├─ writing   → Llama 3.1 70B Abliterated
  └─ general   → Dolphin 2.9 8B
     ↓
[Optional] Synthesizer (Mistral 7B)
  └─ triggered only for multi-domain or report outputs
     ↓
Final Response

Prerequisites

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull models (adjust to your VRAM)
ollama pull mistral:7b          # router + synthesizer
ollama pull qwen2.5-coder:7b   # coding expert (7B fits 8GB GPU)
ollama pull deepseek-r1:7b     # reasoning expert
ollama pull llama3.1:8b        # writing/general expert

# Python
pip install ollama

For 24GB+ VRAM you can use larger experts:

ollama pull qwen2.5-coder:32b  # 20GB
ollama pull deepseek-r1:14b    # 10GB

The Router

The router's job is to classify intent and decide if synthesis is needed. Key design decisions:

Use a small, fast model. The router runs on every single prompt. Mistral 7B at 138 tok/s adds ~0.5s latency. A 70B router would add 5-10s — unacceptable.

Force JSON output. The router must return structured data, not prose.

System prompt:

You are a routing assistant. Analyze the user's request and output ONLY valid JSON.
No preamble, no explanation — JSON only.

{"expert": "<domain>", "reasoning": "<one sentence>", "needs_synthesis": <true|false>}

Domains: coding, reasoning, writing, general

Rules:
- coding: any programming, debugging, code review, architecture
- reasoning: math, logic puzzles, multi-step analysis, STEM
- writing: creative fiction, essays, long-form content, copywriting
- general: everything else

Set needs_synthesis to true ONLY when:
- The answer requires integrating outputs from multiple domains
- A structured report or document is needed
- The question explicitly spans multiple fields

Set needs_synthesis to false for:
- Pure code output (always false — code needs no synthesis)
- Single-domain factual questions
- Direct mathematical answers
- Simple lookups or conversions

The Expert Models

Each expert receives the original prompt without modification. The router's classification is invisible to the expert — it just sees a user message.

DomainModelVRAMTok/sStrength
CodingQwen 2.5 Coder 7B/32B5/20GB132/44HumanEval 88%
ReasoningDeepSeek R1 7B/14B5/10GB129/86MATH-500 92%
WritingLlama 3.1 8B/70B Abliterated5/40GB128/21Long-form, creative
GeneralDolphin 2.9 8B5GB128Uncensored, versatile

When to Use the Synthesizer

The synthesizer is not always needed and adds latency when it runs. Skip it for:

  • Pure code output — synthesizing code introduces errors
  • Single-domain answers — adds nothing
  • Simple factual responses

Trigger it for:

  • "Explain both the code and the mathematics behind this algorithm"
  • "Write a technical report on X covering implementation and theory"
  • Multi-part questions where different parts need different experts

Full Python Implementation

#!/usr/bin/env python3
"""
Local MoE Pipeline
Usage: python moe.py "Your prompt here"
"""

import json
import sys
import ollama

ROUTER_MODEL = "mistral:7b"
SYNTHESIZER_MODEL = "mistral:7b"

EXPERTS = {
    "coding":    "qwen2.5-coder:7b",
    "reasoning": "deepseek-r1:7b",
    "writing":   "llama3.1:8b",
    "general":   "dolphin-llama3:8b",
}

ROUTER_SYSTEM = """You are a routing assistant. Output ONLY valid JSON.
{"expert": "<domain>", "reasoning": "<one sentence>", "needs_synthesis": <true|false>}
Domains: coding, reasoning, writing, general
Set needs_synthesis false for code, math answers, and single-domain questions."""

SYNTH_SYSTEM = """You are a synthesis assistant. Integrate specialist AI output into 
a single coherent response. Preserve technical accuracy. Remove redundancy. 
Format clearly with appropriate sections."""

def route(prompt: str) -> dict:
    print(f"\n[Router] Analyzing prompt...")
    response = ollama.chat(
        model=ROUTER_MODEL,
        messages=[
            {"role": "system", "content": ROUTER_SYSTEM},
            {"role": "user", "content": prompt}
        ]
    )
    raw = response["message"]["content"].strip()
    # Extract JSON robustly
    start, end = raw.find("{"), raw.rfind("}") + 1
    result = json.loads(raw[start:end])
    print(f"[Router] Domain: {result['expert']} | Synthesis: {result['needs_synthesis']}")
    print(f"[Router] Reason: {result.get('reasoning', '')}")
    return result

def call_expert(domain: str, prompt: str) -> str:
    model = EXPERTS.get(domain, EXPERTS["general"])
    print(f"\n[Expert:{domain}] Calling {model}...")
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"]

def synthesize(prompt: str, expert_output: str) -> str:
    print(f"\n[Synthesizer] Refining output...")
    response = ollama.chat(
        model=SYNTHESIZER_MODEL,
        messages=[
            {"role": "system", "content": SYNTH_SYSTEM},
            {"role": "user", "content": f"Question: {prompt}\n\nExpert output:\n{expert_output}"}
        ]
    )
    return response["message"]["content"]

def run(prompt: str) -> str:
    routing = route(prompt)
    domain = routing.get("expert", "general")
    needs_synth = routing.get("needs_synthesis", False)

    expert_output = call_expert(domain, prompt)

    if needs_synth:
        return synthesize(prompt, expert_output)
    return expert_output

if __name__ == "__main__":
    prompt = " ".join(sys.argv[1:]) if len(sys.argv) > 1 else input("Prompt: ")
    result = run(prompt)
    print(f"\n{'─'*60}\n{result}")

Extending the Pipeline

Multi-expert dispatch — for questions spanning multiple domains, route to two experts and synthesize:

def route_multi(prompt: str) -> list[str]:
    # Modify router prompt to return list of domains
    # {"experts": ["coding", "reasoning"], "needs_synthesis": true}
    ...

Streaming output — for interactive use, stream the expert's response:

for chunk in ollama.chat(model=model, messages=msgs, stream=True):
    print(chunk["message"]["content"], end="", flush=True)

Memory across turns — maintain conversation history per expert:

HISTORIES = {domain: [] for domain in EXPERTS}

def call_expert_with_memory(domain: str, prompt: str) -> str:
    HISTORIES[domain].append({"role": "user", "content": prompt})
    response = ollama.chat(model=EXPERTS[domain], messages=HISTORIES[domain])
    output = response["message"]["content"]
    HISTORIES[domain].append({"role": "assistant", "content": output})
    return output

Performance Notes

With Ollama's OLLAMA_KEEP_ALIVE=0, each model unloads immediately after use, freeing VRAM for the next. With OLLAMA_KEEP_ALIVE=300 (5 minutes), recently used models stay loaded — good if the same expert is called repeatedly.

On a 24GB GPU running 7B experts: total latency per query is typically 2–8 seconds including routing. On dual 3090 NVLink with 32B experts: 5–15 seconds depending on output length.

Verdict

The macro-scale MoE pattern gives you access to specialist capability without keeping a massive dense model in VRAM permanently. The architecture is simple enough to build in an afternoon and complex enough to tune significantly — router prompt quality, expert selection, and synthesis triggers all have major impact on output quality.

Use the MoE Builder tool to generate the config and boilerplate code automatically.