RAG systems fail in interesting ways: the retriever returns the wrong chunks, the model ignores the chunks it did get, the model invents citations, the chunks are right but split mid-sentence, the user's question is ambiguous. A single end-to-end "is the answer good?" score collapses all of these into one number that tells you nothing about which dial to turn. RAGAS gives you four numbers instead, each isolating one part of the pipeline.
This page walks through the four core RAGAS metrics, how to build a usable eval dataset, how to run LLM-as-judge without fooling yourself, and how to wire eval into CI.
RAGAS (Es et al., 2023) decomposes RAG quality into four orthogonal axes. Two need a reference answer (context_recall, answer_correctness); two are reference-free.
| Metric | Question it answers | Inputs | What a low score means |
|---|---|---|---|
| faithfulness | Is every claim in the answer supported by the retrieved context? | question, answer, contexts | The model is hallucinating or going beyond the docs. |
| answer_relevancy | Does the answer actually address the question (not just say something true)? | question, answer | The answer is on-topic-ish but doesn't answer what was asked. |
| context_precision | Of the retrieved chunks, how many are actually relevant, and are the relevant ones ranked high? | question, contexts (and ground_truth if available) | Retriever is dragging in noise; reranker is mis-ordering. |
| context_recall | Did the retriever fetch all the chunks needed to answer? | question, contexts, ground_truth | The right chunks are not in your top-K; tune K, embeddings, or chunking. |
Mental model: faithfulness + answer_relevancy grade the generator. context_precision + context_recall grade the retriever. If faithfulness is low but context_recall is high, the model is ignoring good chunks — fix the prompt. If both context metrics are low, fix the retriever first.
Two complementary sources:
Aim for ~50 hand-curated cases at minimum — enough to see metric movement, small enough to label well. Tag each case with a category (factual lookup, multi-hop, summarization, edge case) so you can break out scores per category.
# Synthetic eval generation with Claude.
import anthropic, json
client = anthropic.Anthropic()
PROMPT = """Read the document below and generate {n} diverse evaluation questions.
For each, return a JSON object with:
question - a question a user might actually ask
ground_truth - the answer, taken verbatim or paraphrased from the document
source_span - the exact substring of the document that supports the answer
difficulty - "easy" (single sentence), "medium" (multi-sentence), "hard" (multi-section synthesis)
Return a JSON array. Document:
---
{doc}
---"""
def gen_eval_cases(doc: str, n: int = 10) -> list[dict]:
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
messages=[{"role": "user", "content": PROMPT.format(n=n, doc=doc)}],
)
return json.loads(resp.content[0].text)
Always have a human review the synthetic set before treating its scores as ground truth. The model that wrote the questions cannot grade itself — use a different judge model where possible.
Almost every RAGAS metric is implemented as an LLM call. The judge is the most overlooked failure point in any eval pipeline.
# Minimal pairwise judge with position randomization.
import random, json, anthropic
client = anthropic.Anthropic()
JUDGE_PROMPT = """You are an evaluator. Given a question and two answers (A and B), decide which
is more faithful to the provided context. Respond with JSON:
{{ "winner": "A" | "B" | "tie", "reason": "one sentence" }}
Question: {q}
Context: {ctx}
Answer A: {a}
Answer B: {b}"""
def judge(question, context, ans1, ans2):
if random.random() < 0.5:
a, b, swap = ans1, ans2, False
else:
a, b, swap = ans2, ans1, True
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=300,
messages=[{"role": "user",
"content": JUDGE_PROMPT.format(q=question, ctx=context, a=a, b=b)}],
)
out = json.loads(resp.content[0].text)
if swap and out["winner"] in ("A", "B"):
out["winner"] = "B" if out["winner"] == "A" else "A"
return out
Both are necessary. Offline alone tells you "we didn't break the cases we already had." Online alone is too slow and noisy to gate deployments. Run offline as a CI gate, online as a continuous monitor.
| Tool | Strengths | Best for |
|---|---|---|
| RAGAS | Pure-Python metrics library, framework-agnostic, plays well with HF datasets. | Offline eval scripts and CI. |
| TruLens | Trace recording + metric "feedback functions" you write or pick from a library. | Local dev iteration with rich traces. |
| DeepEval | pytest-style API, large built-in metric set, GitHub Actions integration. | Engineers who want unit-test-style eval ergonomics. |
| LangSmith | Hosted traces + datasets + automated eval; tightly integrated with LangChain/LangGraph. | Teams already on the LangChain stack. |
| Phoenix / Arize | OpenTelemetry-based traces, in-notebook eval UI, online monitoring. | Production observability with eval overlay. |
pip install ragas datasets langchain-openai
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Each row: question, the answer your RAG system produced, the contexts it retrieved,
# and the human-written ground_truth answer.
data = {
"question": [
"What is our 2026 parental leave policy?",
"How many vacation days do new hires get?",
],
"answer": [
"16 weeks of paid leave for primary caregivers and 8 weeks for secondary caregivers.",
"New hires get 15 days of PTO in their first year.",
],
"contexts": [
["Effective Jan 2026, primary caregivers receive 16 weeks paid leave; secondary caregivers receive 8 weeks."],
["All new hires accrue 15 PTO days during year 1, increasing to 20 in year 2."],
],
"ground_truth": [
"Primary caregivers receive 16 weeks paid leave; secondary caregivers receive 8 weeks.",
"15 days in year 1.",
],
}
dataset = Dataset.from_dict(data)
# Strong judge model + same-team embeddings.
judge = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0))
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-large"))
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=judge,
embeddings=emb,
)
print(result)
# {'faithfulness': 1.000, 'answer_relevancy': 0.962,
# 'context_precision': 1.000, 'context_recall': 1.000}
df = result.to_pandas()
df.to_csv("eval_results.csv", index=False)
Run RAGAS on every PR that touches the RAG pipeline and fail the build on regressions.
# .github/workflows/rag-eval.yml
name: rag-eval
on:
pull_request:
paths: ["rag/**", "prompts/**", "eval/**"]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install -r requirements.txt
- name: Run RAG over eval set
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python eval/run_rag.py --in eval/golden.jsonl --out eval/preds.jsonl
- name: Score with RAGAS
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python eval/score.py --preds eval/preds.jsonl --out eval/scores.json
- name: Compare to baseline
run: |
python eval/compare.py \
--baseline eval/baseline_scores.json \
--current eval/scores.json \
--threshold 0.02 # fail if any metric drops > 2 points
- uses: actions/upload-artifact@v4
with: { name: eval-results, path: eval/scores.json }
Keep eval/baseline_scores.json in the repo and update it deliberately when you accept a regression (e.g., trading 1 point of faithfulness for 5 points of context_recall). The diff in the PR is the conversation.
Faithfulness measures whether claims in the answer are grounded in the retrieved context — catches hallucinations. Answer relevance measures whether the answer addresses the question (uses reverse-question generation: ask the LLM to write the question implied by the answer, then cosine-compare to the original). Context precision measures whether retrieved chunks are actually relevant (signal-to-noise of retrieval). Context recall measures whether the chunks covered everything needed to construct the ground-truth answer. Faithfulness + answer relevance evaluate the generator; precision + recall evaluate the retriever.
Three techniques. First, use a different model family as judge than as generator — a Claude-judge on a GPT-generated answer, or vice versa — because models prefer their own style. Second, randomize position when the metric is pairwise (LLMs have a strong "first answer wins" bias). Third, calibrate periodically: have a human label 50–100 examples and check the judge's agreement with humans (Cohen's kappa); if it drops below ~0.6, re-prompt or switch judge. For high-stakes deployments, ensemble multiple judge models and take majority vote.
Bootstrap with a synthesizer: feed RAGAS or a custom prompt your corpus and have an LLM generate (question, answer, source_chunks) tuples, then hand-review and edit. 200–500 high-quality examples beats 5,000 noisy ones. Stratify across question types (factoid, multi-hop, comparison, "no answer") and document types so a single bad slice can't dominate the score. Keep a separate "regression set" of every real-world bad answer a customer reported — those are the cases that actually matter.
I run RAGAS as a GitHub Actions job on every PR that touches the prompts/, retrieval/, or models config directory — gated by path filters so docs PRs don't trigger it. The job loads the gold set, runs the new pipeline, scores all four metrics, and diffs against eval/baseline_scores.json in the repo. Any metric dropping more than 1.5 points fails the check; the dev either fixes the regression or commits a new baseline with a justification in the PR description. Results are uploaded as an artifact so the diff is reviewable.
Usually one of: the prompt doesn't tell the model "answer only from the provided context", so it pads with prior knowledge; the chunks contain conflicting information and the model picks the more popular fact rather than the cited one; or the answer paraphrases so loosely the claim-extractor can't match it back to the source. Fixes in order of effort: tighten the system prompt with explicit "if not in context, say I don't know"; add a re-ranker so the most-supporting chunk is at the top; switch to a larger model — small models hallucinate more even with good context.
Each RAGAS run is N samples × ~5 LLM calls per sample, so 500 samples on GPT-4o is real money per CI run. Mitigations: use a cheaper judge (GPT-4o-mini or Claude Haiku) for nightly runs, reserve the strong judge for release gates. Sample stratified subsets for PR-level checks (~50 examples) and run the full set on main. Cache judge responses by (sample_id, pipeline_hash) so re-runs on the same code are free. Track $/run as its own CI metric so cost regressions get caught.