MyDocumentIntelligence.com is a document-intelligence platform I built for legal and healthcare teams — the kind of customers who cannot paste a contract or a chart note into a public chatbot. The product answers natural-language questions over a customer's own document corpus and returns answers with page-level citations, optional bounding-box highlights, and a verifiable content hash. This page describes the production architecture: what is in the box, why each piece is there, and the tradeoffs that shaped the design.
The framing I keep coming back to is frontier + local hybrid. Frontier models (Claude on Bedrock, GPT-4o on Azure) handle the hard reasoning when the customer's data classification allows it. A local stack (Llama 3.x or Mistral served on vLLM, with bge embeddings) handles everything else — either because the customer is on a self-hosted tier, or because the document was flagged as containing protected information that should never leave the tenant boundary. The same retrieval, prompt, and evaluation layers serve both sides; only the generation endpoint changes.
The two pilot customer profiles are: (a) a mid-size litigation firm whose paralegals spend hours-per-matter pulling clauses out of agreements, motions, and discovery PDFs, and (b) a healthcare operations team that needs to answer "what does this 200-page payor contract say about prior-authorization timelines for procedure X?" without uploading the contract to a public LLM. In both cases the question is not really "summarize this document" — it is "find the specific clause, quote it verbatim, and tell me which page it is on so I can defend the answer."
The failure modes I designed against, all of which I have personally watched generic chat tools commit on real legal documents:
Off-the-shelf RAG (drop documents into a vector DB, top-k cosine, stuff the prompt) does not survive contact with this domain. Contracts have nested numbered sections, defined terms that bind across the document, scanned amendments stapled onto digital originals, and signatures on the last page that need to be retrieved when the question is "who signed this?". The architecture below is everything I learned trying to make those questions work end to end.
The pipeline is built so that every stage is independently swappable — this is what lets the same platform run a SaaS multi-tenant deployment and a self-hosted single-tenant Docker stack from the same codebase.
+-----------------------------+
| Document Sources |
| S3 / SharePoint / Upload |
+--------------+--------------+
|
v
+-----------------------------+
| Ingestion Worker |
| (SQS-driven, idempotent) |
+--------------+--------------+
|
v
+------------------------------+------------------------------+
| OCR / Layout Extraction |
| PyMuPDF (native) | Textract / Azure DI / Tesseract (scan) |
+------------------------------+------------------------------+
|
v
+-----------------------------+
| Section-Aware Chunking |
| (clause/heading boundaries)|
+--------------+--------------+
|
v
+-----------------------------+
| Embedding |
| bge-large (local) / |
| text-embedding-3-large |
+--------------+--------------+
|
v
+-----------------------------+
| Vector + Metadata Store |
| Postgres + pgvector (HNSW) |
+--------------+--------------+
|
query path |
v
+-----------------------------+
| Hybrid Retrieval |
| BM25 + dense, RRF fusion |
+--------------+--------------+
|
v
+-----------------------------+
| Cross-Encoder Reranker |
| bge-reranker-large |
+--------------+--------------+
|
v
+-----------------------------+
| LLM Router |
| Frontier (Claude / GPT) or |
| Local (Llama / Mistral) |
+--------------+--------------+
|
v
+-----------------------------+
| Cited Response + Audit Log |
| JSON schema enforced |
+-----------------------------+
The control plane (FastAPI + Postgres) holds tenants, users, document metadata, and audit records. The data plane (S3 + pgvector + the model endpoints) holds the actual content. The two are kept separate so I can hand a customer the data plane to run inside their VPC without giving them the multi-tenant control plane.
Ingestion is the unglamorous half of the system that determines whether everything downstream works. The corpus is heterogeneous: text-native PDFs from Word exports, scanned PDFs (some with skewed pages and stamped signatures), Word documents with tracked changes, and the occasional image attachment. The router decides per-page whether to use a fast text extractor or fall back to OCR.
The decision rule is simple: if PyMuPDF returns less than 100 characters of extractable text on a page, treat the page as scanned and route to OCR. Per-page granularity matters because one document may have native text for the body and a scanned amendment glued onto the back.
from dataclasses import dataclass
from pathlib import Path
import fitz # PyMuPDF
@dataclass
class PageContent:
page_number: int # 1-indexed
text: str
spans: list[dict] # [{"bbox": (x0, y0, x1, y1), "text": "..."}, ...]
extraction_method: str # "native" | "textract" | "azure-di" | "tesseract"
ocr_confidence: float | None # average per-token confidence; None for native
def ingest(path: Path, ocr_backend: str = "textract") -> list[PageContent]:
"""Per-page extraction. Falls back to OCR when native text is sparse."""
doc = fitz.open(path)
pages: list[PageContent] = []
for i, page in enumerate(doc, start=1):
native_text = page.get_text("text").strip()
if len(native_text) >= 100:
spans = [
{"bbox": s["bbox"], "text": s["text"]}
for block in page.get_text("dict")["blocks"]
for line in block.get("lines", [])
for s in line.get("spans", [])
]
pages.append(PageContent(i, native_text, spans, "native", None))
else:
# Scanned page — render to PNG and OCR it
pix = page.get_pixmap(dpi=300)
pages.append(_ocr_page(pix.tobytes("png"), i, backend=ocr_backend))
return pages
Every ingested page is written back to Postgres with a content SHA-256, the extraction method, and the OCR confidence. That metadata travels with the chunk all the way to the answer, which is what makes the citation chain auditable.
Fixed-size token chunking is the single biggest reason naive RAG fails on contracts. A clause that begins "Section 8.3(b)(ii) — Indemnification" and runs across a page break gets sliced in half by a 512-token splitter, the embedding for each half is mediocre, and neither half retrieves cleanly when the user asks about indemnification. The fix is to chunk by document structure first and only fall back to token-windowed splitting inside oversized sections.
The splitter walks the extracted text and produces chunks whose boundaries respect:
^\s*\d+(\.\d+)*\s+[A-Z]).Each chunk carries the metadata needed to rebuild the citation: document id, page numbers it spans, and the bounding-box hull of the source spans on each page.
import re
import tiktoken
from dataclasses import dataclass, field
ENC = tiktoken.get_encoding("cl100k_base")
SECTION_RE = re.compile(r"^\s*(?:ARTICLE|SECTION|\d+(?:\.\d+)*)\s+", re.MULTILINE)
@dataclass
class Chunk:
doc_id: str
chunk_id: str
text: str
pages: list[int]
bboxes: dict[int, tuple] # page -> bbox hull (x0, y0, x1, y1)
section_path: list[str] = field(default_factory=list)
token_count: int = 0
def chunk_document(pages: list, doc_id: str,
max_tokens: int = 800, overlap: int = 120) -> list[Chunk]:
"""Section-aware splitter with a token-window fallback."""
full_text = "\n".join(p.text for p in pages)
sections = _split_on_headings(full_text) # uses SECTION_RE
chunks: list[Chunk] = []
for section in sections:
toks = ENC.encode(section.text)
if len(toks) <= max_tokens:
chunks.append(_make_chunk(doc_id, section, pages))
continue
# Section is too long — slide a window over it
for start in range(0, len(toks), max_tokens - overlap):
window_text = ENC.decode(toks[start:start + max_tokens])
chunks.append(_make_chunk(doc_id, section.with_text(window_text), pages))
return chunks
The single most important property of this chunker is that chunk.pages and chunk.bboxes are filled in correctly — they are what the UI later uses to draw the yellow highlight on the source PDF when the user clicks a citation.
The platform supports two embedding modes, picked per tenant at provisioning time:
| Mode | Model | Dim | Where it runs | Cost / 1M tokens |
|---|---|---|---|---|
| Frontier | OpenAI text-embedding-3-large | 3072 | OpenAI API | $0.13 |
| Frontier (alt) | Cohere embed-english-v3 | 1024 | Bedrock / Cohere | $0.10 |
| Local / private | BAAI/bge-large-en-v1.5 | 1024 | vLLM on g5.xlarge | ~$0.02 amortized |
| Local / fast | BAAI/bge-small-en-v1.5 | 384 | CPU on the API box | negligible |
For most legal corpora the bge-large model is within ~2 points of MTEB on the frontier alternatives and runs entirely inside the customer's network. That is the single biggest reason the local mode is viable — the embedding gap is small, and the privacy gain is total.
The vector store is Postgres with the pgvector extension. I evaluated FAISS (in-memory; great for prototypes, awkward for multi-tenant updates and ACL filtering), Chroma (fine for development; not production-ready for the scale I needed), Pinecone and Weaviate (excellent products, but adding a managed dependency for a feature I could get from Postgres did not justify the bill or the data-residency conversation). Postgres gives me transactional inserts, row-level security per tenant, and metadata filtering in the same query as the vector search.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
chunk_id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
doc_id UUID NOT NULL,
text TEXT NOT NULL,
pages INT[] NOT NULL,
section_path TEXT[],
doc_type TEXT, -- 'contract' | 'policy' | 'medical_record' | ...
jurisdiction TEXT,
effective_date DATE,
embedding VECTOR(1024) NOT NULL,
content_sha CHAR(64) NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
-- HNSW index tuned for ~5M chunks per tenant
CREATE INDEX chunks_embedding_hnsw
ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Row-level security so a tenant can never see another tenant's chunks
ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON chunks
USING (tenant_id = current_setting('app.tenant_id')::UUID);
HNSW parameters: m = 16 and ef_construction = 200 at index build, ef_search = 80 at query time. Those numbers came out of a sweep against my evaluation set — pushing ef_search higher gave me single-digit recall improvements at noticeable latency cost; pushing it lower started to drop the cross-encoder's input candidates below what I needed.
Dense retrieval alone misses queries that hinge on a specific term — "MFN clause", a docket number, a defined term like "Effective Date" used as a proper noun. BM25 nails those. Conversely, BM25 fails when the user's wording does not match the document ("when can I get out of this contract?" vs. "termination for convenience"). The two are complementary, so I run both and fuse them with Reciprocal Rank Fusion.
Reciprocal Rank Fusion is the right merge function here because it is rank-based, so I do not need to calibrate the BM25 score and the cosine similarity into the same units. I take the top 50 from each retriever, fuse, then send the top 25 through a cross-encoder reranker. The reranker is what turns "the relevant chunk is somewhere in the top 25" into "the relevant chunk is in the top 3" — which is what actually matters because the LLM context window is finite.
from collections import defaultdict
from typing import Sequence
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-large", max_length=512)
def reciprocal_rank_fusion(rankings: Sequence[list[str]], k: int = 60) -> list[tuple[str, float]]:
"""Standard RRF: score(d) = sum_i 1 / (k + rank_i(d))."""
scores: dict[str, float] = defaultdict(float)
for ranking in rankings:
for rank, doc_id in enumerate(ranking, start=1):
scores[doc_id] += 1.0 / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
def hybrid_retrieve(query: str, tenant_id: str, filters: dict,
top_k_each: int = 50, top_n_final: int = 6) -> list[dict]:
bm25_ids = bm25_search(query, tenant_id, filters, limit=top_k_each)
dense_ids = pgvector_search(query, tenant_id, filters, limit=top_k_each)
fused = reciprocal_rank_fusion([bm25_ids, dense_ids])
candidate_ids = [doc_id for doc_id, _ in fused[:25]]
candidates = load_chunks(candidate_ids)
pairs = [(query, c["text"]) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [c for c, _ in ranked[:top_n_final]]
Metadata filters (jurisdiction, doc_type, effective_date ranges) are pushed into both the BM25 query and the pgvector WHERE clause. A "find the termination clause in our California vendor MSAs signed after 2023" query narrows the candidate pool by metadata first, then runs hybrid retrieval over the narrowed set. This is the single biggest precision win after section-aware chunking.
The LLM router is what makes the "frontier + local hybrid" promise real. Every request carries a routing context with the tenant's tier, the document's classification, and a per-request flag indicating whether the retrieved chunks tripped any PII / PHI detector. The router resolves those into one of two endpoints:
anthropic.claude-opus-4-7) for the hardest reasoning — multi-clause questions, contradiction-finding, comparative questions across many documents. GPT-4o on Azure is the secondary frontier endpoint for customers who already have Azure.| Endpoint | p50 latency | p95 latency | $ / 1M out | Used for |
|---|---|---|---|---|
| Claude Opus 4.x (Bedrock) | 2.1s | 4.6s | $15.00 | Hard reasoning, default frontier |
| Claude Sonnet 4.x (Bedrock) | 1.1s | 2.4s | $3.00 | Most cited-answer queries |
| GPT-4o (Azure) | 1.3s | 2.9s | $10.00 | Azure-tier customers |
| Llama 3.1 70B (vLLM, g5.12xl) | 1.6s | 3.8s | ~$0.80 | Self-hosted, PHI, privileged |
| Mistral Large (vLLM) | 1.4s | 3.2s | ~$0.70 | Local fallback |
from enum import Enum
from dataclasses import dataclass
class Endpoint(Enum):
CLAUDE_OPUS = "anthropic.claude-opus-4-7"
CLAUDE_SONNET = "anthropic.claude-sonnet-4-7"
GPT_4O = "azure.gpt-4o"
LLAMA_LOCAL = "vllm.llama3.1-70b-instruct"
@dataclass
class RouteContext:
tenant_tier: str # "saas" | "self-hosted"
doc_classification: str # "public" | "confidential" | "phi" | "privileged"
pii_in_context: bool # any retrieved chunk flagged by detector
question_complexity: str # "simple" | "comparative" | "multi-doc"
allow_cross_region: bool
def route(ctx: RouteContext) -> Endpoint:
must_be_local = (
ctx.tenant_tier == "self-hosted"
or ctx.doc_classification in ("phi", "privileged")
or ctx.pii_in_context
or not ctx.allow_cross_region
)
if must_be_local:
return Endpoint.LLAMA_LOCAL
if ctx.question_complexity == "multi-doc":
return Endpoint.CLAUDE_OPUS
return Endpoint.CLAUDE_SONNET
The router is a hard gate, not a recommendation. A request that resolves to LLAMA_LOCAL never instantiates an outbound HTTP client to the frontier providers — the network calls are not even reachable from that code path. That is what lets me put "your data never leaves your VPC" in the contract and mean it.
The system prompt does three things and only three things: it sets the role, it specifies the refusal policy when the retrieved context does not support an answer, and it enforces the citation contract. Every other behavior I want is in the user prompt or, more importantly, in the JSON schema the response has to match.
SYSTEM_PROMPT = """You are a document-intelligence assistant for legal and
healthcare professionals. You answer questions ONLY using the provided
document excerpts. You do not use prior knowledge of any specific contract,
case, statute, or patient record.
Rules, in order of precedence:
1. If the provided excerpts do not contain enough information to answer the
question, return answer = null and explanation describing exactly what is
missing. Do NOT guess.
2. Every factual claim in your answer must be supported by a citation that
names the chunk_id, page number, and a verbatim supporting_quote of less
than 240 characters from that chunk.
3. Never reproduce more than 240 characters of source text in any single
field. If the user asks for the full clause, instruct them to view the
original document via the citation.
4. If the question asks for legal advice or a clinical recommendation,
answer the underlying factual question and add a one-sentence note that
the user should confirm with a licensed professional.
"""
The output is constrained by a Pydantic schema that I pass to the model as a tool definition (Anthropic tool-use) or a structured-output schema (OpenAI). The model cannot produce free-form prose — it has to produce a JSON object that matches the schema, every time. That is the single highest-leverage change I made for production reliability.
from pydantic import BaseModel, Field
from typing import Literal
class Citation(BaseModel):
chunk_id: str
doc_id: str
page: int = Field(ge=1)
supporting_quote: str = Field(max_length=240)
confidence: float = Field(ge=0.0, le=1.0)
class StructuredAnswer(BaseModel):
answer: str | None
citations: list[Citation]
explanation: str
refusal_reason: Literal["insufficient_context", "out_of_scope", None] = None
ANSWER_TOOL = {
"name": "submit_answer",
"description": "Submit the final answer in the required structured form.",
"input_schema": StructuredAnswer.model_json_schema(),
}
def call_claude_with_schema(client, model_id, system, user, retrieved):
return client.messages.create(
model=model_id,
system=system,
max_tokens=1500,
temperature=0,
tools=[ANSWER_TOOL],
tool_choice={"type": "tool", "name": "submit_answer"},
messages=[{
"role": "user",
"content": _format_context(user, retrieved),
}],
)
Forcing tool_choice to the answer tool means the model has no path to produce free text. Combined with temperature=0 and the Pydantic validator on the way out, the failure mode "model returns prose I cannot parse" went from a measurable slice of production traffic to zero.
Every citation in the response carries a chunk_id that resolves through the chunks table to a doc_id, a list of pages, and the bounding-box hull on each page. The web UI uses those bounding boxes to draw a yellow highlight on the source PDF rendered alongside the answer. From the user's perspective: they read the answer, click "page 14", the PDF panel scrolls to page 14 and highlights the cited region. That single interaction was what convinced the legal pilot customer to sign.
For the audit story I add a per-response provenance record:
import hashlib, json, time
from uuid import uuid4
def write_audit_record(tenant_id: str, request_id: str, question: str,
retrieved_chunks: list[dict], structured_answer: dict,
model_id: str) -> dict:
"""One immutable record per answered question."""
payload = {
"request_id": request_id,
"tenant_id": tenant_id,
"ts": int(time.time()),
"model_id": model_id,
"question": question,
"retrieved": [
{"chunk_id": c["chunk_id"], "doc_id": c["doc_id"],
"content_sha": c["content_sha"]}
for c in retrieved_chunks
],
"answer": structured_answer,
}
canonical = json.dumps(payload, sort_keys=True, separators=(",", ":")).encode()
payload["content_hash"] = hashlib.sha256(canonical).hexdigest()
s3_put_object_lock(payload, bucket="mdi-audit", retention_days=2555)
return payload
The audit record goes to an S3 bucket with Object Lock in compliance mode and a seven-year retention. A customer who later needs to prove "on this date, given these source documents, the system returned this answer" can produce the record, recompute the SHA, and verify nothing was edited. That is what "verifiable" means in this product, and it is most of why the regulated-industry conversations actually go anywhere.
There are two evaluation regimes that run continuously: an offline gold dataset and a nightly RAGAS sweep against a held-out portion of each tenant's corpus.
The gold dataset is human-labeled: a question, the documents from which the answer must come, the expected page number, and the expected supporting quote. Roughly 250 questions across legal and healthcare, expanded as the customers send me the queries that broke. For each question I compute:
answer = null.
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness, answer_relevancy,
context_precision, context_recall,
)
def run_ragas(eval_rows: list[dict]) -> dict:
"""Each row: {question, answer, contexts: list[str], ground_truth}."""
ds = Dataset.from_list(eval_rows)
result = evaluate(
ds,
metrics=[faithfulness, answer_relevancy,
context_precision, context_recall],
)
return result.to_pandas().mean(numeric_only=True).to_dict()
Run output is shipped to LangSmith for the SaaS deployment and to Phoenix for self-hosted (Phoenix runs in-cluster, no outbound calls). The threshold I gate releases on: faithfulness must not drop more than 0.02 from the previous release on the gold set; if it does, the deployment does not promote.
The SaaS deployment is FastAPI on AWS Fargate, fronted by CloudFront with WAF rules for rate limiting and basic prompt-injection patterns. Postgres + pgvector runs on RDS (db.r6g.2xlarge for the current load). Documents live in S3 with bucket-level KMS, server-side encryption, and a per-tenant prefix. Secrets are in AWS Secrets Manager, retrieved at boot via the task's IAM role — no credentials in environment files.
The local-LLM endpoint (Llama 3.1 70B) is vLLM on a single g5.12xlarge for SaaS customers who opted into the local-only tier. The same Docker image runs on a customer-provided GPU box for fully self-hosted deployments — that is the entire point of building local-first.
# docker-compose.yml — self-hosted single-tenant deployment
services:
api:
image: ghcr.io/mydocintel/api:1.14.0
environment:
- DB_URL=postgresql://app:${DB_PASSWORD}@postgres:5432/mdi
- LLM_BACKEND=vllm
- VLLM_ENDPOINT=http://vllm:8000/v1
- EMBEDDING_BACKEND=bge-large
- DEPLOYMENT_MODE=self-hosted
- ALLOW_FRONTIER=false
depends_on: [postgres, vllm]
ports: ["8443:8443"]
postgres:
image: pgvector/pgvector:pg16
environment:
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- pg-data:/var/lib/postgresql/data
vllm:
image: vllm/vllm-openai:v0.6.3
command: >
--model meta-llama/Meta-Llama-3.1-70B-Instruct
--tensor-parallel-size 4
--max-model-len 16384
--gpu-memory-utilization 0.92
deploy:
resources:
reservations:
devices: [{ driver: nvidia, count: 4, capabilities: [gpu] }]
volumes:
pg-data:
The single bash command below is what the customer runs to bring up the self-hosted stack on a fresh EC2 g5.12xlarge with the NVIDIA container toolkit installed. That deliberate shortness is itself a feature — legal IT teams will not adopt anything that takes a week to install.
# On a fresh g5.12xlarge with Docker + nvidia-container-toolkit
git clone https://github.com/mydocintel/self-hosted.git
cd self-hosted
cp .env.sample .env && vi .env # set DB_PASSWORD, license key, KMS key id
docker compose pull
docker compose up -d
./scripts/healthcheck.sh # verifies api, db, vllm are all green
Structured logging is non-negotiable for a system that has to produce an audit trail. Every request emits a single JSON log line with the request id, tenant id, model id, retrieved chunk ids, token counts (input / output / total), latency, and the boolean refusal flag. I use those records to build a per-tenant cost dashboard (input tokens × price + output tokens × price, summed by day) and a latency dashboard with p50 / p95 / p99 by endpoint.
PII / PHI detection runs on both sides of the LLM call. On the input side I run Microsoft Presidio across the retrieved chunks; if any high-confidence detection lands the request is rerouted to the local model regardless of the original routing decision. On the output side, the same detector runs on the structured answer and any leaked identifiers cause a hard refusal — the user sees a "this answer was suppressed because it contained protected information" message rather than the leak.
from presidio_analyzer import AnalyzerEngine
ANALYZER = AnalyzerEngine()
PHI_ENTITIES = {"US_SSN", "MEDICAL_LICENSE", "US_DRIVER_LICENSE",
"PERSON", "DATE_TIME", "PHONE_NUMBER", "EMAIL_ADDRESS"}
def scan_for_pii(text: str, threshold: float = 0.6) -> list[dict]:
results = ANALYZER.analyze(text=text, language="en",
entities=list(PHI_ENTITIES))
return [
{"type": r.entity_type, "score": r.score,
"start": r.start, "end": r.end}
for r in results if r.score >= threshold
]
def enforce_pii_routing(ctx, retrieved_chunks):
for chunk in retrieved_chunks:
if scan_for_pii(chunk["text"]):
ctx.pii_in_context = True
return ctx
return ctx
For the SaaS tier I also attach Bedrock Guardrails to the Claude calls as a defense in depth — denied topics, profanity, and prompt-injection detection. Rate limiting is per-tenant token-bucket in Redis (60 questions / minute soft cap, configurable). The audit log goes to S3 Object Lock in compliance mode so that even an account compromise cannot delete prior records before the retention period expires.
The honest list of things I would change next, in roughly the order I plan to ship them:
I started with the conventional 512-token chunks at 64-token overlap and ran a small RAGAS eval on a held-out set of contract questions. Legal text has long, defined-term-heavy sentences, so smaller chunks fragmented clauses across boundaries and dropped context_recall. I ended up at ~800 tokens with 100-token overlap and a hard rule never to split inside a numbered clause — I parse the document into clause-level units first and only re-chunk if a single clause exceeds the budget. The eval moved faithfulness up about 6 points after that change.
Legal queries contain a lot of exact tokens that embeddings under-weight — party names, statute citations like "12 U.S.C. § 1841", section numbers, and defined terms in quotes. BM25 nails those; dense vectors handle the paraphrase cases ("can the lessee assign?" vs "assignment by tenant"). I fuse the two ranked lists with Reciprocal Rank Fusion (k=60) which avoids having to calibrate score scales per corpus, then a cross-encoder reranks the top 50 down to the 5 we send to the LLM. Recall@50 from retrieval and nDCG@5 after rerank are tracked separately because they answer different questions.
Routing is on three signals: document sensitivity (customer-flagged "do not send to third party" forces a local Llama-3.1 70B on a self-hosted vLLM endpoint), query complexity (a small classifier flags multi-document synthesis vs simple lookup), and cost budget per tenant. The default is Claude Sonnet for everyday Q&A because the price/quality is hard to beat; Opus is reserved for synthesis across >5 chunks. Every routed call logs the chosen model, latency, and token counts so I can run a monthly review and re-tune the thresholds.
I use pgvector with a tenant_id column on the chunks table and a btree index on (tenant_id, document_id). Every retrieval query has WHERE tenant_id = $1 enforced at the application layer and again as a Postgres row-level security policy — defense in depth, because a missing filter is a data-leak bug. For the few large customers I move them to their own schema so their HNSW index isn't competing with smaller tenants for shared_buffers. Weaviate has native multi-tenancy that's nicer ergonomically but pgvector wins on operational simplicity at my scale.
I keep a gold set of about 200 (question, document, expected_answer, expected_chunks) tuples that I built by hand with a paralegal. CI runs RAGAS on every PR that touches retrieval, prompts, or model versions — faithfulness, answer_relevance, context_precision, context_recall — and compares against a baseline_scores.json checked into the repo. Regressions block the merge unless I explicitly bump the baseline with a justification. I also run a weekly LLM-as-judge pairwise comparison on production traffic samples to catch slow drift the gold set wouldn't see.
Three things. First, I'd add a small fine-tuned clause classifier at ingest so the head of the question distribution (termination, indemnification, governing law) becomes a database lookup instead of a retrieval round-trip — faster, cheaper, more deterministic. Second, I'd build a knowledge-graph layer alongside pgvector for cross-document entity questions ("every agreement where Acme granted exclusivity") because pure vector retrieval can't aggregate. Third, I'd adopt MCP for customer-system integrations so I'm not writing a custom adapter every time a customer wants their case-management system in scope.