This page walks through building a Retrieval-Augmented Generation (RAG) pipeline in Python from first principles. We cover chunking, embeddings, vector stores, similarity search, prompt construction, and an end-to-end example using sentence-transformers, FAISS, and Anthropic Claude. The goal is a working prototype you can adapt to production.
A minimal RAG system has two phases: indexing (offline) and querying (online).
Every component is swappable. The choice of chunker, embedding model, and vector store dominates retrieval quality far more than the LLM itself.
Chunks are the unit of retrieval. Too large and you waste context tokens; too small and you lose semantic completeness. Aim for 200–800 tokens per chunk depending on the source document.
The simplest strategy: split text every N characters with optional overlap. Fast and predictable, but blind to sentence boundaries.
def fixed_chunks(text: str, size: int = 1000, overlap: int = 200) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start = end - overlap
return chunks
Recursive chunking tries a list of separators in order (paragraph, sentence, word) and falls back when a chunk exceeds the size limit. This is what LangChain's RecursiveCharacterTextSplitter does and is the safe default for most prose.
def recursive_chunks(text: str, size: int = 1000,
separators: list[str] = None) -> list[str]:
separators = separators or ["\n\n", "\n", ". ", " ", ""]
if len(text) <= size:
return [text]
sep = separators[0]
parts = text.split(sep) if sep else list(text)
chunks, buf = [], ""
for part in parts:
candidate = buf + sep + part if buf else part
if len(candidate) <= size:
buf = candidate
else:
if buf:
chunks.append(buf)
if len(part) > size:
chunks.extend(recursive_chunks(part, size, separators[1:]))
buf = ""
else:
buf = part
if buf:
chunks.append(buf)
return chunks
Embed each sentence, then split when the cosine distance between adjacent sentences exceeds a threshold. Slower at index time but produces topically coherent chunks — useful for technical or legal documents.
import numpy as np
from sentence_transformers import SentenceTransformer
def semantic_chunks(text: str, model: SentenceTransformer,
threshold: float = 0.5) -> list[str]:
sentences = [s.strip() for s in text.split(". ") if s.strip()]
embeddings = model.encode(sentences, normalize_embeddings=True)
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
sim = np.dot(embeddings[i - 1], embeddings[i])
if sim < threshold:
chunks.append(". ".join(current))
current = [sentences[i]]
else:
current.append(sentences[i])
if current:
chunks.append(". ".join(current))
return chunks
Rule of thumb: start with recursive chunking at 500 tokens / 50 token overlap. Move to semantic only if retrieval quality plateaus.
The embedding model maps text to a fixed-length vector. Two practical options:
all-MiniLM-L6-v2 is the workhorse (384-dim, ~80MB). For higher quality use BAAI/bge-large-en-v1.5 (1024-dim).text-embedding-3-small / -large: hosted, supports Matryoshka truncation, top-tier on MTEB. text-embedding-3-small at 1536-dim is the cost/quality sweet spot; ada-002 is legacy and should not be used for new projects.# Local embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode(
["the quick brown fox", "a fast auburn vulpine"],
normalize_embeddings=True, # critical for cosine via dot product
batch_size=32,
)
print(vectors.shape) # (2, 384)
# OpenAI embeddings
from openai import OpenAI
client = OpenAI()
resp = client.embeddings.create(
model="text-embedding-3-small",
input=["the quick brown fox", "a fast auburn vulpine"],
)
vectors = [d.embedding for d in resp.data]
Always normalize embeddings if you intend to use dot product as a cosine proxy — it removes magnitude effects and makes FAISS IndexFlatIP equivalent to cosine similarity.
| Store | Best For | Avoid When |
|---|---|---|
| FAISS | Local prototypes, single-machine workloads, <10M vectors | You need filtering by metadata or multi-process writes |
| Chroma | Prototyping with metadata filters, embedded use, notebooks | Production scale — persistence layer is fragile under load |
| pgvector | Production. You already run Postgres; want SQL filters and ACID | Vectors >100M and you need sub-10ms p99 — consider Qdrant or Vespa |
| Qdrant / Weaviate | Dedicated vector DB with rich filtering and hybrid search | Operating overhead for a small project — pgvector is simpler |
FAISS example:
import faiss
import numpy as np
dim = 384
index = faiss.IndexFlatIP(dim) # inner product == cosine if normalized
vectors = np.array(model.encode(chunks, normalize_embeddings=True),
dtype=np.float32)
index.add(vectors)
faiss.write_index(index, "corpus.faiss")
pgvector example:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
chunk_text TEXT NOT NULL,
source TEXT,
embedding VECTOR(1536)
);
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
Cosine similarity measures the angle between vectors; dot product measures both angle and magnitude. For normalized embeddings the two are equivalent and dot product is a single multiply-add per dimension — faster on every hardware backend.
import numpy as np
def cosine(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def dot(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b))
Practical guidance: normalize once at index time, then use dot product everywhere. The only time to use raw dot product on un-normalized vectors is when magnitude carries signal — rare in modern transformer embeddings.
The retrieved chunks become a context block the LLM is instructed to ground on. A robust template:
PROMPT_TEMPLATE = """You are a precise technical assistant. Answer the user's
question using ONLY the context below. If the context does not contain the
answer, say "I don't know based on the provided context."
<context>
{context}
</context>
Question: {question}
Answer:"""
def build_prompt(question: str, chunks: list[str]) -> str:
context = "\n\n---\n\n".join(
f"[chunk {i + 1}]\n{c}" for i, c in enumerate(chunks)
)
return PROMPT_TEMPLATE.format(context=context, question=question)
Three things matter here:
A complete script: load documents, chunk, embed with sentence-transformers, store in FAISS, then query with Anthropic Claude.
pip install sentence-transformers faiss-cpu anthropic numpy
export ANTHROPIC_API_KEY="sk-ant-..."
import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic
# ---------- 1. Load and chunk ----------
def load_docs(paths: list[str]) -> list[dict]:
docs = []
for p in paths:
with open(p, "r", encoding="utf-8") as f:
docs.append({"source": p, "text": f.read()})
return docs
def recursive_chunks(text: str, size: int = 800, overlap: int = 100):
chunks, start = [], 0
while start < len(text):
chunks.append(text[start:start + size])
start += size - overlap
return chunks
# ---------- 2. Build the index ----------
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
def build_index(docs: list[dict]):
records = []
for d in docs:
for i, c in enumerate(recursive_chunks(d["text"])):
records.append({"source": d["source"], "chunk_id": i, "text": c})
vectors = embed_model.encode(
[r["text"] for r in records],
normalize_embeddings=True,
show_progress_bar=True,
).astype(np.float32)
index = faiss.IndexFlatIP(vectors.shape[1])
index.add(vectors)
return index, records
# ---------- 3. Retrieve ----------
def retrieve(query: str, index, records, k: int = 4):
q_vec = embed_model.encode([query], normalize_embeddings=True).astype(np.float32)
scores, ids = index.search(q_vec, k)
return [
{**records[i], "score": float(s)}
for s, i in zip(scores[0], ids[0])
if i != -1
]
# ---------- 4. Generate ----------
client = Anthropic()
PROMPT = """Answer the question using ONLY the context. If the context is
insufficient, say so.
<context>
{context}
</context>
Question: {question}"""
def answer(question: str, hits: list[dict]) -> str:
context = "\n\n---\n\n".join(
f"[{h['source']} #{h['chunk_id']}]\n{h['text']}" for h in hits
)
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": PROMPT.format(context=context, question=question),
}],
)
return msg.content[0].text
# ---------- 5. Run ----------
if __name__ == "__main__":
docs = load_docs(["docs/handbook.md", "docs/api.md"])
index, records = build_index(docs)
question = "How do I rotate the API signing key?"
hits = retrieve(question, index, records, k=4)
print(answer(question, hits))
That is a complete RAG system in roughly 80 lines. Persist the FAISS index with faiss.write_index and the records list with pickle or JSON so you only re-embed when the corpus changes.
tiktoken when budget matters.BAAI/bge-reranker-base) and keep top 4.Once the prototype works, the next levers are: hybrid search (BM25 + dense), query rewriting, metadata filters, and a re-ranker. None of those matter until the basics are solid.
Start with the conventional 512-token chunks at ~10% overlap because that's what most embedding models were tuned for. Then move based on data shape: short FAQ-style content goes smaller (200–300 tokens) so each chunk is one self-contained answer; long technical docs go larger (800–1000) so a clause or section stays intact. Always chunk on semantic boundaries first (paragraphs, headings, sentences) and only fall back to fixed-size when the unit exceeds budget. Validate with a recall metric on a held-out question set — if recall@5 is below 80%, the chunks are wrong before anything downstream matters.
Default to the top open-source model on the MTEB leaderboard for your size budget — bge-large-en, e5-mistral, or the latest Voyage if API is fine. For domain-heavy corpora (biomedical, legal, code) check whether a domain-tuned model exists; the 5–15 point gain over a generic model usually beats fine-tuning your own. Three practical constraints: max sequence length must accommodate your chunk size (or chunks get truncated silently), embedding dimensionality affects vector store cost (768 vs 1536 vs 3072 isn't free at billions of vectors), and the model must be the same one you used to build the index — mixing models is a silent quality killer.
For prototypes and small production (<10M vectors), pgvector wins on operational simplicity — you already have Postgres, you get transactions and joins, and HNSW is built in. For mid-scale (10–100M) with multi-tenancy, Weaviate or Qdrant give you native tenancy isolation and hybrid search out of the box. For huge scale (>100M, low-latency requirement), managed services like Pinecone or self-hosted Milvus are designed for it. Don't pick a vector DB based on benchmarks — pick on operational fit (does my team know it? can I back it up?), then verify performance in your environment.
top-k is a tradeoff: larger k improves recall (the right chunk is more likely included) but degrades precision and burns tokens. The principled approach: pick the smallest k where recall@k on your gold set plateaus, typically 5–20. If you're using a re-ranker, retrieve generously (k=50–100) at the dense stage and let the reranker compress to top 3–5 for the LLM. If you're not, k=5 is the sane default and adjust based on faithfulness scores — if the model misses information that's in chunks 6–10, raise k or add reranking; if faithfulness drops because of distractor chunks, lower k.
In rough order of frequency: (1) the chunker split the answer across boundaries; (2) the embedding model is generic but the corpus is domain-specific; (3) BM25 is missing — queries with rare exact tokens (SKUs, names, statute numbers) fail dense retrieval; (4) no metadata filters — every query searches the whole corpus when it should be scoped to a tenant/document/date range; (5) the system prompt doesn't instruct "answer only from the provided context", so the model pads with prior knowledge that contradicts the source. Fix in that order; each one is cheaper than the next.
The signals are operational: you can't tell which prompt or model version produced which answer (need versioning + tracing); a customer reports a bad answer and you can't replay it (need request logging); the eval is a notebook and you've shipped two regressions you didn't catch (need CI eval); ingestion takes hours and you're afraid to re-run it (need incremental ingest with content-hash dedup); two people are afraid to touch the prompt because they don't know what it'll break (need a gold set and metric). Each of those is a refactor cue, not a "rewrite from scratch" cue.