RAG from Scratch

This page walks through building a Retrieval-Augmented Generation (RAG) pipeline in Python from first principles. We cover chunking, embeddings, vector stores, similarity search, prompt construction, and an end-to-end example using sentence-transformers, FAISS, and Anthropic Claude. The goal is a working prototype you can adapt to production.

1. The RAG Pipeline at a Glance

A minimal RAG system has two phases: indexing (offline) and querying (online).

Indexing: load documents → chunk → embed → store vectors.
Querying: embed query → retrieve top-k chunks → build prompt → call LLM.

Every component is swappable. The choice of chunker, embedding model, and vector store dominates retrieval quality far more than the LLM itself.

2. Chunking Strategies

Chunks are the unit of retrieval. Too large and you waste context tokens; too small and you lose semantic completeness. Aim for 200–800 tokens per chunk depending on the source document.

Fixed-Size Chunking

The simplest strategy: split text every N characters with optional overlap. Fast and predictable, but blind to sentence boundaries.

def fixed_chunks(text: str, size: int = 1000, overlap: int = 200) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Recursive Chunking

Recursive chunking tries a list of separators in order (paragraph, sentence, word) and falls back when a chunk exceeds the size limit. This is what LangChain's RecursiveCharacterTextSplitter does and is the safe default for most prose.

def recursive_chunks(text: str, size: int = 1000,
                     separators: list[str] = None) -> list[str]:
    separators = separators or ["\n\n", "\n", ". ", " ", ""]
    if len(text) <= size:
        return [text]
    sep = separators[0]
    parts = text.split(sep) if sep else list(text)
    chunks, buf = [], ""
    for part in parts:
        candidate = buf + sep + part if buf else part
        if len(candidate) <= size:
            buf = candidate
        else:
            if buf:
                chunks.append(buf)
            if len(part) > size:
                chunks.extend(recursive_chunks(part, size, separators[1:]))
                buf = ""
            else:
                buf = part
    if buf:
        chunks.append(buf)
    return chunks

Semantic Chunking

Embed each sentence, then split when the cosine distance between adjacent sentences exceeds a threshold. Slower at index time but produces topically coherent chunks — useful for technical or legal documents.

import numpy as np
from sentence_transformers import SentenceTransformer

def semantic_chunks(text: str, model: SentenceTransformer,
                    threshold: float = 0.5) -> list[str]:
    sentences = [s.strip() for s in text.split(". ") if s.strip()]
    embeddings = model.encode(sentences, normalize_embeddings=True)
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        sim = np.dot(embeddings[i - 1], embeddings[i])
        if sim < threshold:
            chunks.append(". ".join(current))
            current = [sentences[i]]
        else:
            current.append(sentences[i])
    if current:
        chunks.append(". ".join(current))
    return chunks

Rule of thumb: start with recursive chunking at 500 tokens / 50 token overlap. Move to semantic only if retrieval quality plateaus.

3. Embedding Generation

The embedding model maps text to a fixed-length vector. Two practical options:

sentence-transformers (local): free, fast on CPU for small corpora, no network egress. all-MiniLM-L6-v2 is the workhorse (384-dim, ~80MB). For higher quality use BAAI/bge-large-en-v1.5 (1024-dim).
OpenAI text-embedding-3-small / -large: hosted, supports Matryoshka truncation, top-tier on MTEB. text-embedding-3-small at 1536-dim is the cost/quality sweet spot; ada-002 is legacy and should not be used for new projects.

# Local embeddings
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode(
    ["the quick brown fox", "a fast auburn vulpine"],
    normalize_embeddings=True,  # critical for cosine via dot product
    batch_size=32,
)
print(vectors.shape)  # (2, 384)

# OpenAI embeddings
from openai import OpenAI

client = OpenAI()
resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=["the quick brown fox", "a fast auburn vulpine"],
)
vectors = [d.embedding for d in resp.data]

Always normalize embeddings if you intend to use dot product as a cosine proxy — it removes magnitude effects and makes FAISS IndexFlatIP equivalent to cosine similarity.

4. Choosing a Vector Store

Store	Best For	Avoid When
FAISS	Local prototypes, single-machine workloads, <10M vectors	You need filtering by metadata or multi-process writes
Chroma	Prototyping with metadata filters, embedded use, notebooks	Production scale — persistence layer is fragile under load
pgvector	Production. You already run Postgres; want SQL filters and ACID	Vectors >100M and you need sub-10ms p99 — consider Qdrant or Vespa
Qdrant / Weaviate	Dedicated vector DB with rich filtering and hybrid search	Operating overhead for a small project — pgvector is simpler

FAISS example:

import faiss
import numpy as np

dim = 384
index = faiss.IndexFlatIP(dim)  # inner product == cosine if normalized
vectors = np.array(model.encode(chunks, normalize_embeddings=True),
                   dtype=np.float32)
index.add(vectors)
faiss.write_index(index, "corpus.faiss")

pgvector example:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    chunk_text  TEXT NOT NULL,
    source      TEXT,
    embedding   VECTOR(1536)
);

CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

5. Similarity Search: Cosine vs Dot Product

Cosine similarity measures the angle between vectors; dot product measures both angle and magnitude. For normalized embeddings the two are equivalent and dot product is a single multiply-add per dimension — faster on every hardware backend.

import numpy as np

def cosine(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def dot(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b))

Practical guidance: normalize once at index time, then use dot product everywhere. The only time to use raw dot product on un-normalized vectors is when magnitude carries signal — rare in modern transformer embeddings.

6. Prompt Construction with Retrieved Context

The retrieved chunks become a context block the LLM is instructed to ground on. A robust template:

PROMPT_TEMPLATE = """You are a precise technical assistant. Answer the user's
question using ONLY the context below. If the context does not contain the
answer, say "I don't know based on the provided context."

<context>
{context}
</context>

Question: {question}
Answer:"""

def build_prompt(question: str, chunks: list[str]) -> str:
    context = "\n\n---\n\n".join(
        f"[chunk {i + 1}]\n{c}" for i, c in enumerate(chunks)
    )
    return PROMPT_TEMPLATE.format(context=context, question=question)

Three things matter here:

Explicit grounding instruction — reduces hallucination on out-of-corpus questions.
Chunk delimiters — the model can refer back ("according to chunk 2…").
Token budget — cap context at roughly 50% of the model window so the user can still ask follow-ups.

7. End-to-End Example

A complete script: load documents, chunk, embed with sentence-transformers, store in FAISS, then query with Anthropic Claude.

pip install sentence-transformers faiss-cpu anthropic numpy
export ANTHROPIC_API_KEY="sk-ant-..."

import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic

# ---------- 1. Load and chunk ----------
def load_docs(paths: list[str]) -> list[dict]:
    docs = []
    for p in paths:
        with open(p, "r", encoding="utf-8") as f:
            docs.append({"source": p, "text": f.read()})
    return docs

def recursive_chunks(text: str, size: int = 800, overlap: int = 100):
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start:start + size])
        start += size - overlap
    return chunks

# ---------- 2. Build the index ----------
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

def build_index(docs: list[dict]):
    records = []
    for d in docs:
        for i, c in enumerate(recursive_chunks(d["text"])):
            records.append({"source": d["source"], "chunk_id": i, "text": c})

    vectors = embed_model.encode(
        [r["text"] for r in records],
        normalize_embeddings=True,
        show_progress_bar=True,
    ).astype(np.float32)

    index = faiss.IndexFlatIP(vectors.shape[1])
    index.add(vectors)
    return index, records

# ---------- 3. Retrieve ----------
def retrieve(query: str, index, records, k: int = 4):
    q_vec = embed_model.encode([query], normalize_embeddings=True).astype(np.float32)
    scores, ids = index.search(q_vec, k)
    return [
        {**records[i], "score": float(s)}
        for s, i in zip(scores[0], ids[0])
        if i != -1
    ]

# ---------- 4. Generate ----------
client = Anthropic()

PROMPT = """Answer the question using ONLY the context. If the context is
insufficient, say so.

<context>
{context}
</context>

Question: {question}"""

def answer(question: str, hits: list[dict]) -> str:
    context = "\n\n---\n\n".join(
        f"[{h['source']} #{h['chunk_id']}]\n{h['text']}" for h in hits
    )
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": PROMPT.format(context=context, question=question),
        }],
    )
    return msg.content[0].text

# ---------- 5. Run ----------
if __name__ == "__main__":
    docs = load_docs(["docs/handbook.md", "docs/api.md"])
    index, records = build_index(docs)

    question = "How do I rotate the API signing key?"
    hits = retrieve(question, index, records, k=4)
    print(answer(question, hits))

That is a complete RAG system in roughly 80 lines. Persist the FAISS index with faiss.write_index and the records list with pickle or JSON so you only re-embed when the corpus changes.

8. Common Pitfalls

Forgetting to normalize embeddings — you will get worse retrieval and won't know why.
Chunking on character count instead of tokens — tokens vary by language; use tiktoken when budget matters.
Returning too many chunks — LLMs degrade with long contexts ("lost in the middle"). Start with k=4.
No re-ranking — for high-stakes RAG, retrieve k=20 then re-rank with a cross-encoder (e.g. BAAI/bge-reranker-base) and keep top 4.
Stale index — documents change; schedule re-indexing or use an incremental store like pgvector.
Mixing embedding models — query and corpus must use the same model. Migrating models requires a full re-embed.

Once the prototype works, the next levers are: hybrid search (BM25 + dense), query rewriting, metadata filters, and a re-ranker. None of those matter until the basics are solid.

Common Interview Questions:

How do you choose chunk size?

Start with the conventional 512-token chunks at ~10% overlap because that's what most embedding models were tuned for. Then move based on data shape: short FAQ-style content goes smaller (200–300 tokens) so each chunk is one self-contained answer; long technical docs go larger (800–1000) so a clause or section stays intact. Always chunk on semantic boundaries first (paragraphs, headings, sentences) and only fall back to fixed-size when the unit exceeds budget. Validate with a recall metric on a held-out question set — if recall@5 is below 80%, the chunks are wrong before anything downstream matters.

How do you choose an embedding model?

Default to the top open-source model on the MTEB leaderboard for your size budget — bge-large-en, e5-mistral, or the latest Voyage if API is fine. For domain-heavy corpora (biomedical, legal, code) check whether a domain-tuned model exists; the 5–15 point gain over a generic model usually beats fine-tuning your own. Three practical constraints: max sequence length must accommodate your chunk size (or chunks get truncated silently), embedding dimensionality affects vector store cost (768 vs 1536 vs 3072 isn't free at billions of vectors), and the model must be the same one you used to build the index — mixing models is a silent quality killer.

How do you choose a vector store?

For prototypes and small production (<10M vectors), pgvector wins on operational simplicity — you already have Postgres, you get transactions and joins, and HNSW is built in. For mid-scale (10–100M) with multi-tenancy, Weaviate or Qdrant give you native tenancy isolation and hybrid search out of the box. For huge scale (>100M, low-latency requirement), managed services like Pinecone or self-hosted Milvus are designed for it. Don't pick a vector DB based on benchmarks — pick on operational fit (does my team know it? can I back it up?), then verify performance in your environment.

How do you tune top-k?

top-k is a tradeoff: larger k improves recall (the right chunk is more likely included) but degrades precision and burns tokens. The principled approach: pick the smallest k where recall@k on your gold set plateaus, typically 5–20. If you're using a re-ranker, retrieve generously (k=50–100) at the dense stage and let the reranker compress to top 3–5 for the LLM. If you're not, k=5 is the sane default and adjust based on faithfulness scores — if the model misses information that's in chunks 6–10, raise k or add reranking; if faithfulness drops because of distractor chunks, lower k.

What are the most common reasons a from-scratch RAG performs badly?

In rough order of frequency: (1) the chunker split the answer across boundaries; (2) the embedding model is generic but the corpus is domain-specific; (3) BM25 is missing — queries with rare exact tokens (SKUs, names, statute numbers) fail dense retrieval; (4) no metadata filters — every query searches the whole corpus when it should be scoped to a tenant/document/date range; (5) the system prompt doesn't instruct "answer only from the provided context", so the model pads with prior knowledge that contradicts the source. Fix in that order; each one is cheaper than the next.

When does a from-scratch RAG outgrow a one-file prototype?

The signals are operational: you can't tell which prompt or model version produced which answer (need versioning + tracing); a customer reports a bad answer and you can't replay it (need request logging); the eval is a notebook and you've shipped two regressions you didn't catch (need CI eval); ingestion takes hours and you're afraid to re-run it (need incremental ingest with content-hash dedup); two people are afraid to touch the prompt because they don't know what it'll break (need a gold set and metric). Each of those is a refactor cue, not a "rewrite from scratch" cue.