MAR 6, 2026 LEARNING 8 MIN READ

How We Think About RAG Evaluation (After Getting It Wrong Several Times)

RAG evaluation seems straightforward until you try to do it well. We document three phases of our evaluation approach, each prompted by a failure the previous method missed.

By Wenable Labs

RAG evaluation seems straightforward until you actually try to do it well. Retrieve some documents, generate an answer, check if the answer is correct. In theory, this is a solved problem. In practice, we have gone through three distinct phases of how we evaluate our retrieval-augmented generation systems, and each phase was prompted by a production failure that the previous approach completely missed.

This is not a post about how to build a RAG system. We have written about that elsewhere. This is a post about the harder, less discussed problem: how do you know whether your RAG system is actually working? Not in a demo. Not on a curated test set. In production, at scale, across domains where wrong answers carry real consequences.

We are documenting this evolution honestly because the progression from naive to rigorous evaluation is a path that most teams walk, and the failures along the way are predictable. If this spares another team a production incident, it will have been worth writing.

Phase 1: Vibes-Based Evaluation (and Why It Failed)

Our first RAG evaluation approach, if it can be called an approach at all, was manual inspection. Run a handful of representative queries, read the generated outputs, and decide whether they “look right.” The team would gather around a screen, skim through responses, and make a collective judgment. If the answers seemed reasonable and the sources looked plausible, the system passed.

This worked for demos. It impressed stakeholders. It got us through initial reviews and pilot approvals. It did not work in production.

The problems with vibes-based evaluation are structural, not incidental. Subjective quality assessment varies by reviewer what one engineer considers a good answer, another might flag as incomplete. There is no consistency, no baseline, and no way to track quality over time. Beyond subjectivity, scale makes manual review impossible. You cannot read 10,000 queries. You cannot even read 500 with the attention they deserve. And the most dangerous failure mode in RAG is exactly the one that manual review is worst at catching: the plausible but wrong answer. When a response has the right format, the right tone, and cites sources that exist, it takes genuine domain expertise and careful reading to notice that the content is subtly incorrect.

We learned this lesson the hard way. We deployed a fleet compliance RAG system designed to answer questions about FMCSA and DOT regulations. It passed every round of manual review. The answers were well-structured, cited specific CFR sections, and read like authoritative regulatory guidance. The problem was that the retrieval layer was pulling regulations from the wrong jurisdiction. State-level rules were being served in response to federal compliance queries. The format was correct. The citations were real. The content was wrong. A compliance officer caught it three weeks into production. No amount of vibes-based evaluation would have detected this because the failure was in the retrieval, not the generation, and the generation was skilled enough to make bad retrieval look good.

Phase 2: Automated Metrics (and Their Limitations)

After the fleet compliance incident, we built a proper automated evaluation pipeline. This was a significant investment, and it represented a genuine step forward. We implemented evaluation across two dimensions: retrieval quality and generation quality.

On the retrieval side, we tracked precision@k (of the k documents retrieved, how many are actually relevant to the query?), recall@k (of all relevant documents in the corpus, how many did we retrieve?), and mean reciprocal rank (is the most relevant document ranked first, or buried in position four?). These metrics require a labeled evaluation set query-document pairs where relevance has been judged by a human which itself is a substantial effort to create and maintain.

On the generation side, we measured faithfulness (does the generated answer actually use the retrieved context, or does the model hallucinate beyond it?), relevance (does the response address the question that was asked?), and groundedness (can each factual claim in the response be traced back to a specific passage in the retrieved documents?). We used RAGAS as a starting framework and supplemented it with custom evaluation scripts tailored to our domains.

This approach caught many issues that manual review never would have. Retrieval precision regressions after re-indexing became immediately visible. Faithfulness scores flagged cases where the model was generating plausible content not grounded in the retrieved context. We could run evaluations on hundreds of queries in minutes, track metrics over time, and set threshold alerts for quality degradation.

But automated metrics have their own blind spots, and those blind spots are systematic. The metrics told us quality was high on a pharmaceutical quality system we were building. Faithfulness scores were above 0.9. Retrieval precision was strong. The system was, by every automated measure, performing well. Then we put the outputs in front of domain experts quality assurance professionals with decades of experience in pharmaceutical manufacturing and they flagged nearly a third of the responses as misleading.

The responses were technically correct. They cited real documents. They did not hallucinate. But they missed critical context. A question about deviation investigation would receive an answer that accurately described one aspect of the investigation framework while omitting regulatory requirements that any experienced QA professional would consider essential. The answers were not wrong. They were dangerously incomplete. And no automated metric we had built could distinguish between a complete answer and a technically-correct-but-missing-critical-context answer because that distinction requires domain expertise that no metric encodes.

Automated metrics measure surface quality. They do not measure whether the answer would lead a domain expert to the right decision. That gap is where the most consequential failures hide.

Phase 3: Hybrid Evaluation (Where We Are Now)

Our current evaluation framework combines three methods, each compensating for the blind spots of the others. It is not elegant. It is not fully automated. But it catches failures that neither vibes-based review nor automated metrics alone can detect.

Automated metrics on every production query. Every query that passes through a production RAG system is evaluated automatically for retrieval precision, faithfulness, and groundedness. This layer provides broad coverage and fast feedback. It catches regressions immediately if a re-indexing job corrupts chunk boundaries, or an embedding model update shifts the retrieval distribution, the metrics surface it within hours. This is the early warning system.

LLM-as-judge evaluation on a sampled subset. We use a separate language model to evaluate a random sample of production queries on dimensions that simple metrics cannot capture: answer completeness, appropriate caveats and qualifications, logical coherence, and whether the response addresses the actual intent behind the query rather than just its literal text. This layer catches nuance that retrieval and faithfulness scores miss. It is not perfect the judge model has its own biases and limitations but it bridges the gap between raw metrics and human judgment at a fraction of the cost of full human review.

Domain expert review on a smaller sample. For every domain we serve, a qualified subject matter expert reviews a rotating sample of production outputs. The sample size scales with the stakes: pharmaceutical and regulatory compliance systems get a larger expert review window than internal knowledge management tools. This is the most expensive layer and the slowest, but it is the only layer that catches domain-specific failures missing context, inappropriate generalizations, or technically correct answers that would lead a practitioner to the wrong conclusion.

Drift monitoring across all layers. We track every metric over time and alert on sustained degradation. A single bad query is noise. A week-over-week decline in faithfulness scores, or a drop in domain expert agreement rate, is a signal that something in the pipeline has shifted new documents, changed query patterns, model updates and requires investigation.

The weighting across these layers is not uniform. For high-stakes domains like pharmaceutical quality and fleet compliance, the expert review sample is larger and the thresholds for automated metrics are tighter. For lower-stakes internal tools, automated metrics carry more weight and expert review is less frequent. The cost of evaluation should be proportional to the cost of being wrong.

The Evaluation Metrics We Track

For teams building their own RAG evaluation framework, here is the practical list of metrics we have converged on after considerable iteration:

  • Retrieval precision@5 and recall@5. Measured against a maintained evaluation set of query-document relevance judgments. Precision tells us whether we are retrieving noise. Recall tells us whether we are missing signal.
  • Response faithfulness score (0-1). The proportion of claims in the generated response that are grounded in the retrieved context. Anything below 0.85 triggers review.
  • Citation accuracy. Does the cited source actually support the specific claim it is attached to? This is distinct from faithfulness an answer can be faithful to the retrieved context in general while attaching a specific citation to the wrong claim.
  • Answer completeness. Did the response address all parts of the question? Evaluated by the LLM-as-judge layer, scored on a 1-5 scale.
  • Hallucination rate. The percentage of responses in a given period that contain at least one claim not supported by the retrieved context. We track this as a trailing seven-day average.
  • Domain expert agreement rate. On the expert-reviewed sample, what percentage of responses does the domain expert rate as accurate and complete? This is our ground truth signal. When automated metrics say quality is high but expert agreement drops, we trust the experts.

No single metric tells the full story. The value is in the relationships between them when faithfulness is high but expert agreement is low, the problem is completeness or context, not hallucination. When precision is high but faithfulness drops, the model is ignoring good retrieval results. The diagnostic value comes from reading the metrics together.

What We Still Get Wrong

RAG evaluation is not a solved problem for us. We are still learning. Our evaluation sets drift out of date as corpora change. Our LLM-as-judge layer occasionally disagrees with domain experts in ways we cannot fully explain. The cost of expert review limits the sample sizes we can practically maintain. And there are entire failure categories temporal reasoning errors, multi-hop retrieval failures, subtle entity confusion that none of our current methods catch reliably.

RAG evaluation is not a one-time activity. It is a continuous discipline that evolves alongside the systems it measures. The approach that works today will miss failures that emerge tomorrow as data changes, queries evolve, and models are updated. Build evaluation into the pipeline, not around it. Treat it as infrastructure, not as a checkpoint. And above all, never trust vibes.