RAG Evaluation with RAGAS

RAG systems fail in interesting ways: the retriever returns the wrong chunks, the model ignores the chunks it did get, the model invents citations, the chunks are right but split mid-sentence, the user's question is ambiguous. A single end-to-end "is the answer good?" score collapses all of these into one number that tells you nothing about which dial to turn. RAGAS gives you four numbers instead, each isolating one part of the pipeline.

This page walks through the four core RAGAS metrics, how to build a usable eval dataset, how to run LLM-as-judge without fooling yourself, and how to wire eval into CI.

1. Why RAG Eval Is Hard

No reference answer: most production questions have no canonical "correct" answer to BLEU/ROUGE against.
Multiple valid answers: "List three risks" has many correct sets of three.
Hallucinations look correct: a fluent, confident answer is the hardest kind of wrong answer to spot.
Retrieval and generation entangle: a bad answer might be a retrieval failure (wrong chunks fetched) or a generation failure (right chunks ignored). End-to-end metrics conflate them.
The eval set drifts: real user questions evolve faster than your hand-curated fixtures.

2. The Four RAGAS Metrics

RAGAS (Es et al., 2023) decomposes RAG quality into four orthogonal axes. Two need a reference answer (context_recall, answer_correctness); two are reference-free.

Metric	Question it answers	Inputs	What a low score means
faithfulness	Is every claim in the answer supported by the retrieved context?	question, answer, contexts	The model is hallucinating or going beyond the docs.
answer_relevancy	Does the answer actually address the question (not just say something true)?	question, answer	The answer is on-topic-ish but doesn't answer what was asked.
context_precision	Of the retrieved chunks, how many are actually relevant, and are the relevant ones ranked high?	question, contexts (and ground_truth if available)	Retriever is dragging in noise; reranker is mis-ordering.
context_recall	Did the retriever fetch all the chunks needed to answer?	question, contexts, ground_truth	The right chunks are not in your top-K; tune K, embeddings, or chunking.

Mental model: faithfulness + answer_relevancy grade the generator. context_precision + context_recall grade the retriever. If faithfulness is low but context_recall is high, the model is ignoring good chunks — fix the prompt. If both context metrics are low, fix the retriever first.

3. Building a Gold/Eval Dataset

Two complementary sources:

Synthetic generation: ask Claude to read a document and produce (question, answer, source span) triples. Fast; covers the corpus uniformly; biased toward the kinds of questions LLMs find easy.
Real user logs: the questions you actually get in production. Slow to label; biased toward what users ask, which is what you actually need to be good at.

Aim for ~50 hand-curated cases at minimum — enough to see metric movement, small enough to label well. Tag each case with a category (factual lookup, multi-hop, summarization, edge case) so you can break out scores per category.


# Synthetic eval generation with Claude.
import anthropic, json

client = anthropic.Anthropic()

PROMPT = """Read the document below and generate {n} diverse evaluation questions.
For each, return a JSON object with:
  question     - a question a user might actually ask
  ground_truth - the answer, taken verbatim or paraphrased from the document
  source_span  - the exact substring of the document that supports the answer
  difficulty   - "easy" (single sentence), "medium" (multi-sentence), "hard" (multi-section synthesis)

Return a JSON array. Document:
---
{doc}
---"""

def gen_eval_cases(doc: str, n: int = 10) -> list[dict]:
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        messages=[{"role": "user", "content": PROMPT.format(n=n, doc=doc)}],
    )
    return json.loads(resp.content[0].text)

Always have a human review the synthetic set before treating its scores as ground truth. The model that wrote the questions cannot grade itself — use a different judge model where possible.

4. LLM-as-Judge Done Properly

Almost every RAGAS metric is implemented as an LLM call. The judge is the most overlooked failure point in any eval pipeline.

Use a stronger model than the system under test. Judging is harder than answering. If your system uses Claude Sonnet, judge with Claude Opus or GPT-4o.
Use a different model family from the generator when possible — same-family judges are biased to prefer same-family outputs (the "self-preference" bias documented in Panickssery et al., 2024).
Pairwise > pointwise when comparing two systems. "Which of A, B is better?" has lower variance than "rate A from 1-10."
Randomize positions: in pairwise, half the cases should have A first, half B first. Position bias is real.
Calibrate against humans: hand-grade 30 cases, compare to the judge, measure agreement (Cohen's kappa). If kappa < 0.6 your judge prompt needs work.
Force structured output: have the judge emit JSON with a numeric score and a reason. Free-form judgments are unparsable and unreviewable.


# Minimal pairwise judge with position randomization.
import random, json, anthropic

client = anthropic.Anthropic()

JUDGE_PROMPT = """You are an evaluator. Given a question and two answers (A and B), decide which
is more faithful to the provided context. Respond with JSON:
  {{ "winner": "A" | "B" | "tie", "reason": "one sentence" }}

Question: {q}
Context: {ctx}
Answer A: {a}
Answer B: {b}"""

def judge(question, context, ans1, ans2):
    if random.random() < 0.5:
        a, b, swap = ans1, ans2, False
    else:
        a, b, swap = ans2, ans1, True
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=300,
        messages=[{"role": "user",
                   "content": JUDGE_PROMPT.format(q=question, ctx=context, a=a, b=b)}],
    )
    out = json.loads(resp.content[0].text)
    if swap and out["winner"] in ("A", "B"):
        out["winner"] = "B" if out["winner"] == "A" else "A"
    return out

5. Offline vs Online Evaluation

Offline: run your eval set against the new system in CI. Cheap, deterministic, fast feedback. Catches regressions on cases you've already thought of.
Online: A/B test in production. Capture per-request traces (question, contexts, answer), sample 1-5%, run RAGAS metrics on the sample. Catches the cases you didn't think of.

Both are necessary. Offline alone tells you "we didn't break the cases we already had." Online alone is too slow and noisy to gate deployments. Run offline as a CI gate, online as a continuous monitor.

6. Tools: RAGAS, TruLens, DeepEval, LangSmith, Phoenix

Tool	Strengths	Best for
RAGAS	Pure-Python metrics library, framework-agnostic, plays well with HF datasets.	Offline eval scripts and CI.
TruLens	Trace recording + metric "feedback functions" you write or pick from a library.	Local dev iteration with rich traces.
DeepEval	pytest-style API, large built-in metric set, GitHub Actions integration.	Engineers who want unit-test-style eval ergonomics.
LangSmith	Hosted traces + datasets + automated eval; tightly integrated with LangChain/LangGraph.	Teams already on the LangChain stack.
Phoenix / Arize	OpenTelemetry-based traces, in-notebook eval UI, online monitoring.	Production observability with eval overlay.

7. End-to-End RAGAS Code Example


pip install ragas datasets langchain-openai


from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Each row: question, the answer your RAG system produced, the contexts it retrieved,
# and the human-written ground_truth answer.
data = {
    "question": [
        "What is our 2026 parental leave policy?",
        "How many vacation days do new hires get?",
    ],
    "answer": [
        "16 weeks of paid leave for primary caregivers and 8 weeks for secondary caregivers.",
        "New hires get 15 days of PTO in their first year.",
    ],
    "contexts": [
        ["Effective Jan 2026, primary caregivers receive 16 weeks paid leave; secondary caregivers receive 8 weeks."],
        ["All new hires accrue 15 PTO days during year 1, increasing to 20 in year 2."],
    ],
    "ground_truth": [
        "Primary caregivers receive 16 weeks paid leave; secondary caregivers receive 8 weeks.",
        "15 days in year 1.",
    ],
}

dataset = Dataset.from_dict(data)

# Strong judge model + same-team embeddings.
judge = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0))
emb = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-large"))

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=judge,
    embeddings=emb,
)

print(result)
# {'faithfulness': 1.000, 'answer_relevancy': 0.962,
#  'context_precision': 1.000, 'context_recall': 1.000}

df = result.to_pandas()
df.to_csv("eval_results.csv", index=False)

8. Common Failure Modes

Test set leakage: synthetic questions generated from the same docs the system retrieves over will score artificially high. Hold out a slice of the corpus when generating eval, or write questions from a different doc set entirely.
Judge bias toward verbose answers: LLM judges reward length; pin a max-tokens cap on candidate answers and on the judge's preference for brevity in the prompt.
Self-preference: GPT-4o rates GPT-4o answers higher; same with Claude. Cross-family judges, or ensemble three judges and majority-vote.
Metric over-fitting: optimizing only for faithfulness produces dull, "I cannot answer from the context" responses. Track all four together.
Stale ground truth: when policies change, your ground_truth answers go stale silently. Version the eval set with the docs.
Tiny eval sets: 10 cases will swing wildly run-to-run. Aim for 50 minimum, 200 if you can afford it.

9. Sample Eval CI Workflow

Run RAGAS on every PR that touches the RAG pipeline and fail the build on regressions.


# .github/workflows/rag-eval.yml
name: rag-eval
on:
  pull_request:
    paths: ["rag/**", "prompts/**", "eval/**"]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install -r requirements.txt

      - name: Run RAG over eval set
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python eval/run_rag.py --in eval/golden.jsonl --out eval/preds.jsonl

      - name: Score with RAGAS
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python eval/score.py --preds eval/preds.jsonl --out eval/scores.json

      - name: Compare to baseline
        run: |
          python eval/compare.py \
            --baseline eval/baseline_scores.json \
            --current  eval/scores.json \
            --threshold 0.02   # fail if any metric drops > 2 points

      - uses: actions/upload-artifact@v4
        with: { name: eval-results, path: eval/scores.json }

Keep eval/baseline_scores.json in the repo and update it deliberately when you accept a regression (e.g., trading 1 point of faithfulness for 5 points of context_recall). The diff in the PR is the conversation.

Common Interview Questions:

What are the four core RAGAS metrics and what does each catch?

Faithfulness measures whether claims in the answer are grounded in the retrieved context — catches hallucinations. Answer relevance measures whether the answer addresses the question (uses reverse-question generation: ask the LLM to write the question implied by the answer, then cosine-compare to the original). Context precision measures whether retrieved chunks are actually relevant (signal-to-noise of retrieval). Context recall measures whether the chunks covered everything needed to construct the ground-truth answer. Faithfulness + answer relevance evaluate the generator; precision + recall evaluate the retriever.

How do you guard against judge-LLM bias?

Three techniques. First, use a different model family as judge than as generator — a Claude-judge on a GPT-generated answer, or vice versa — because models prefer their own style. Second, randomize position when the metric is pairwise (LLMs have a strong "first answer wins" bias). Third, calibrate periodically: have a human label 50–100 examples and check the judge's agreement with humans (Cohen's kappa); if it drops below ~0.6, re-prompt or switch judge. For high-stakes deployments, ensemble multiple judge models and take majority vote.

How do you build a gold dataset without spending months?

Bootstrap with a synthesizer: feed RAGAS or a custom prompt your corpus and have an LLM generate (question, answer, source_chunks) tuples, then hand-review and edit. 200–500 high-quality examples beats 5,000 noisy ones. Stratify across question types (factoid, multi-hop, comparison, "no answer") and document types so a single bad slice can't dominate the score. Keep a separate "regression set" of every real-world bad answer a customer reported — those are the cases that actually matter.

Where in CI does the eval run, and what does it block?

I run RAGAS as a GitHub Actions job on every PR that touches the prompts/, retrieval/, or models config directory — gated by path filters so docs PRs don't trigger it. The job loads the gold set, runs the new pipeline, scores all four metrics, and diffs against eval/baseline_scores.json in the repo. Any metric dropping more than 1.5 points fails the check; the dev either fixes the regression or commits a new baseline with a justification in the PR description. Results are uploaded as an artifact so the diff is reviewable.

Why is faithfulness low even when the right context was retrieved?

Usually one of: the prompt doesn't tell the model "answer only from the provided context", so it pads with prior knowledge; the chunks contain conflicting information and the model picks the more popular fact rather than the cited one; or the answer paraphrases so loosely the claim-extractor can't match it back to the source. Fixes in order of effort: tighten the system prompt with explicit "if not in context, say I don't know"; add a re-ranker so the most-supporting chunk is at the top; switch to a larger model — small models hallucinate more even with good context.

How do you handle cost when the eval set grows?

Each RAGAS run is N samples × ~5 LLM calls per sample, so 500 samples on GPT-4o is real money per CI run. Mitigations: use a cheaper judge (GPT-4o-mini or Claude Haiku) for nightly runs, reserve the strong judge for release gates. Sample stratified subsets for PR-level checks (~50 examples) and run the full set on main. Cache judge responses by (sample_id, pipeline_hash) so re-runs on the same code are free. Track $/run as its own CI metric so cost regressions get caught.