Using the Ollama API: Build Your First Integration

Ollama exposes a REST API on localhost:11434 that lets any application talk to your local models. It's OpenAI-compatible, so any tool built for ChatGPT's API works locally with a one-line change.

The API Endpoints

Endpoint	Method	What it does
`/api/generate`	POST	Single prompt, returns completion
`/api/chat`	POST	Chat with history
`/api/tags`	GET	List downloaded models
`/api/pull`	POST	Download a model
`/api/delete`	DELETE	Remove a model
`/v1/chat/completions`	POST	OpenAI-compatible chat

Basic curl Test

# Test the API is responding
curl http://localhost:11434/api/tags

# Single prompt
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "What is Q4_K_M quantization?",
  "stream": false
}'

# Chat format
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "user", "content": "Explain VRAM in one paragraph"}
  ],
  "stream": false
}'

Python — Direct API

import requests

def ask(prompt: str, model: str = "llama3.1:8b") -> str:
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": [{"role": "user", "content": prompt}],
            "stream": False,
        }
    )
    return response.json()["message"]["content"]

print(ask("What GPU do I need for 70B models?"))

Python — Ollama Library (recommended)

pip install ollama

import ollama

# Simple chat
response = ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain quantization"}]
)
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Write a haiku about GPU VRAM"}],
    stream=True
):
    print(chunk["message"]["content"], end="", flush=True)

# With system prompt
response = ollama.chat(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a local AI hardware expert."},
        {"role": "user", "content": "What's the best GPU for 70B inference?"}
    ]
)

Multi-Turn Conversation

import ollama

history = []

def chat(user_message: str, model: str = "llama3.1:8b") -> str:
    history.append({"role": "user", "content": user_message})
    
    response = ollama.chat(model=model, messages=history)
    assistant_message = response["message"]["content"]
    
    history.append({"role": "assistant", "content": assistant_message})
    return assistant_message

# Maintains context across turns
print(chat("What GPU should I buy for local AI?"))
print(chat("What if my budget is $500?"))
print(chat("Will that run 70B models?"))

OpenAI SDK Compatibility

Ollama's /v1/ endpoints are OpenAI-compatible. Any code using the OpenAI SDK works locally by changing the base URL:

pip install openai

from openai import OpenAI

# Point to local Ollama instead of OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required by SDK, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is ExLlamaV2?"}
    ]
)

print(response.choices[0].message.content)

This means you can swap between GPT-4o and your local model by changing one line.

Practical Example: Document Q&A

import ollama

def ask_about_document(document: str, question: str) -> str:
    return ollama.chat(
        model="llama3.1:8b",
        messages=[
            {
                "role": "system",
                "content": "Answer questions about the provided document accurately and concisely."
            },
            {
                "role": "user",
                "content": f"Document:\n\n{document}\n\nQuestion: {question}"
            }
        ]
    )["message"]["content"]

with open("my_document.txt") as f:
    doc = f.read()

print(ask_about_document(doc, "What are the main points?"))

Practical Example: Code Review

import ollama

def review_code(code: str, language: str = "Python") -> str:
    return ollama.chat(
        model="qwen2.5-coder:7b",  # Coding-specific model
        messages=[
            {
                "role": "system",
                "content": f"You are a senior {language} developer. Review code for bugs, performance issues, and style. Be specific and direct."
            },
            {
                "role": "user",
                "content": f"Review this code:\n\n```{language.lower()}\n{code}\n```"
            }
        ]
    )["message"]["content"]

Listing and Switching Models

import ollama

# List all available models
models = ollama.list()
for model in models["models"]:
    print(f"{model['name']} - {model['size'] / 1e9:.1f}GB")

# Pull a model
ollama.pull("mistral:7b")

# Check what's currently loaded
# (use ollama ps in terminal, no Python API for this yet)

Environment Variables for API Access

To access the API from other machines on your network:

# Allow all network interfaces
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Or set permanently in systemd/environment

Then access from other devices at http://YOUR_IP:11434.

Next Steps

Modelfile Guide — create custom model configs with system prompts
llama.cpp Server Mode — alternative API server
Build a Local MoE Pipeline — multi-model orchestration