Building a Local RAG Pipeline with Ollama
RAG (Retrieval-Augmented Generation) lets a model answer questions about documents it wasn't trained on. Instead of stuffing a document into the context window, you store it in a vector database, retrieve only the relevant chunks at query time, and pass those to the model. Everything here runs locally — no cloud APIs.
How RAG Works
Document → Chunk → Embed → Vector DB
↓
Query → Embed → Similarity Search → Top-K Chunks
↓
[System Prompt + Chunks + Query] → LLM → Answer
- Chunk — split documents into overlapping segments (500–1000 tokens each)
- Embed — convert each chunk to a vector using an embedding model
- Store — save vectors in a vector database
- Query — embed the user's question, find the most similar chunks
- Generate — pass retrieved chunks + question to the LLM
Install Dependencies
pip install ollama chromadb langchain-text-splitters pypdf
Pull an Embedding Model
Ollama includes embedding models. Pull one:
ollama pull nomic-embed-text # Fast, good quality, 274MB
ollama pull mxbai-embed-large # Higher quality, 670MB
Complete RAG Implementation
import ollama
import chromadb
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os
# ── Config ──────────────────────────────────
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.1:8b"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
TOP_K = 5
# ── Setup ────────────────────────────────────
client = chromadb.Client()
collection = client.get_or_create_collection("documents")
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
)
# ── Ingest a document ────────────────────────
def ingest_text(text: str, source: str = "document"):
chunks = splitter.split_text(text)
embeddings = []
for chunk in chunks:
response = ollama.embeddings(model=EMBED_MODEL, prompt=chunk)
embeddings.append(response["embedding"])
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[f"{source}_{i}" for i in range(len(chunks))],
metadatas=[{"source": source} for _ in chunks],
)
print(f"Ingested {len(chunks)} chunks from {source}")
def ingest_file(filepath: str):
ext = os.path.splitext(filepath)[1].lower()
if ext == ".pdf":
from pypdf import PdfReader
reader = PdfReader(filepath)
text = "\n".join(page.extract_text() for page in reader.pages)
else:
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
ingest_text(text, source=os.path.basename(filepath))
# ── Query ─────────────────────────────────────
def query(question: str) -> str:
# Embed the question
q_embedding = ollama.embeddings(model=EMBED_MODEL, prompt=question)["embedding"]
# Retrieve top-K similar chunks
results = collection.query(
query_embeddings=[q_embedding],
n_results=TOP_K,
)
chunks = results["documents"][0]
sources = [m["source"] for m in results["metadatas"][0]]
if not chunks:
return "No relevant documents found."
# Build context
context = "\n\n---\n\n".join(
f"[Source: {src}]\n{chunk}"
for chunk, src in zip(chunks, sources)
)
# Generate answer
response = ollama.chat(
model=CHAT_MODEL,
messages=[
{
"role": "system",
"content": (
"You are a helpful assistant that answers questions based on "
"provided context. Answer based only on the context given. "
"If the context doesn't contain the answer, say so clearly. "
"Cite the source when relevant."
),
},
{
"role": "user",
"content": f"Context:\n\n{context}\n\nQuestion: {question}",
},
],
)
return response["message"]["content"]
# ── Main ─────────────────────────────────────
if __name__ == "__main__":
# Ingest documents
ingest_file("my_document.pdf")
ingest_file("notes.txt")
# Query
while True:
q = input("\nQuestion (q to quit): ").strip()
if q.lower() == "q":
break
print("\n" + query(q))
Persistent Vector Database
The above uses an in-memory ChromaDB instance. For persistence across sessions:
# Replace the client line with:
client = chromadb.PersistentClient(path="./chroma_db")
Data is saved to disk and reloaded on next run. Only ingest documents once.
Improving Retrieval Quality
Larger chunk overlap — increase CHUNK_OVERLAP to 100–150 for documents where context spans chunk boundaries (e.g. legal documents, technical specs).
Reranking — after retrieving TOP_K chunks, rerank them by relevance before passing to the LLM:
# Simple reranking by keyword overlap
def rerank(chunks: list[str], question: str, top_n: int = 3) -> list[str]:
question_words = set(question.lower().split())
scored = [
(chunk, len(set(chunk.lower().split()) & question_words))
for chunk in chunks
]
scored.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in scored[:top_n]]
Hybrid search — combine semantic (vector) search with keyword (BM25) search:
pip install rank-bm25
from rank_bm25 import BM25Okapi
# Build BM25 index alongside ChromaDB
all_chunks = [] # Store all chunks at ingest time
bm25 = BM25Okapi([chunk.split() for chunk in all_chunks])
def hybrid_query(question: str, alpha: float = 0.7) -> list[str]:
# Semantic results
q_embed = ollama.embeddings(model=EMBED_MODEL, prompt=question)["embedding"]
semantic = collection.query(query_embeddings=[q_embed], n_results=10)
# BM25 results
bm25_scores = bm25.get_scores(question.split())
# Combine scores (alpha = weight for semantic)
# ... merge and rerank
Streaming Responses
def query_stream(question: str):
# ... (same retrieval as above)
for chunk in ollama.chat(
model=CHAT_MODEL,
messages=[...],
stream=True,
):
print(chunk["message"]["content"], end="", flush=True)
print()
Performance Notes
Embedding speed with nomic-embed-text on an RTX 4090: ~500–1000 chunks per minute. For large document sets (hundreds of PDFs), run ingestion overnight.
Query latency: embedding the question takes ~50ms, ChromaDB retrieval ~10ms, LLM generation dominates at whatever your model's tok/s allows.
Next Steps
- Ollama API Guide — integrate the RAG pipeline into applications
- Token Budget Calculator — plan context window usage for large documents
- Context Length Calculator — understand KV cache limits for long contexts