Retrieval-Augmented Generation is the most commonly attempted enterprise AI pattern. It is also the most commonly underestimated.
The basic concept is simple enough. Chunk your documents, embed them into vectors, store them in a vector database, retrieve the most relevant chunks at query time, and feed them to a language model as context. Every tutorial makes this look like an afternoon project. In controlled demos with clean, homogeneous documents and predictable queries, it works remarkably well.
Production is a different matter entirely.
The gap between a working RAG demo and a system that enterprise teams actually trust is enormous. Documents are messy, heterogeneous, and constantly changing. Queries are ambiguous. Retrieval quality degrades silently. Failure modes are subtle the system does not crash, it simply returns confident answers grounded in the wrong context.
We have built RAG pipelines across device management (WeGuard), fleet operations under FMCSA and DOT regulations, and pharmaceutical quality systems involving handwritten deviation reports and batch records. These domains share almost nothing in terms of content, but the engineering challenges are strikingly similar. The lessons that follow come directly from production.
The Hard Problems Tutorials Skip
Most RAG tutorials assume a corpus of clean text documents. Production corpora look nothing like this. Here are the five problems that consistently surface once a RAG system leaves the prototype stage.
Heterogeneous document types. A real enterprise knowledge base includes PDFs with complex layouts, scanned handwritten documents, structured database records, API responses, spreadsheets, and configuration files. In our pharmaceutical quality work, a single investigation might reference a handwritten deviation report, a structured batch record from a LIMS database, and a regulatory guidance document from the FDA. Each document type requires a different parsing strategy, and a universal chunking approach produces poor results across all of them.
Documents that change frequently. Enterprise knowledge is not static. Device management policies update weekly. FMCSA regulations receive amendments and interpretive guidance on a rolling basis. Quality SOPs are revised after every significant deviation. A RAG system that does not handle document versioning, incremental re-indexing, and stale-chunk eviction will silently serve outdated information which, in regulated industries, creates compliance risk.
Chunking strategy matters more than embedding model choice. This is perhaps the most underappreciated lesson. We have seen teams spend weeks evaluating embedding models while using naive fixed-size chunking that splits sentences mid-clause, separates headers from their content, and destroys the structural relationships within documents. Switching from fixed-size to semantic chunking has consistently produced larger retrieval quality gains than any embedding model swap.
Retrieval quality degrades silently. Unlike a traditional search engine where users immediately notice bad results, RAG systems obscure retrieval failures behind fluent language model outputs. The model generates a plausible-sounding answer even when the retrieved context is irrelevant. Without continuous retrieval evaluation, quality can erode for weeks before anyone notices.
Domain-specific terminology requires specialized treatment. General-purpose embedding models struggle with technical vocabulary. Terms like “OOS” (out-of-specification), “CAPA” (corrective and preventive action), or device model identifiers like “SM-G998B” do not embed well in models trained primarily on web text. In fleet compliance, a query about “HOS violations” needs to retrieve documents about Hours of Service not documents that happen to contain the substring “hos” in other contexts.
Our RAG Architecture
After iterating across multiple domains, we have converged on a pipeline architecture that addresses these challenges systematically. The design is not novel in any single component the value is in how the components interact and where we invest engineering effort.
Document Processing
We do not use a single parser. Each document type gets a specialized processing pipeline.
- Scanned and handwritten documents pass through an OCR pipeline with layout analysis. For pharmaceutical deviation reports, we use vision-language models to extract structured fields from handwritten forms, then validate extracted values against expected ranges and formats.
- Structured data sources (databases, APIs, configuration systems) are converted into natural language summaries with preserved metadata. A device configuration record becomes a paragraph describing the device state, annotated with the device ID, group membership, and last-modified timestamp.
- Regulatory and policy documents receive section-aware parsing that preserves hierarchical structure title, part, subpart, section, subsection. This hierarchy becomes metadata attached to every chunk, enabling filtered retrieval (for example, “retrieve only from 49 CFR Part 395”).
Semantic Chunking with Metadata
We chunk at semantic boundaries rather than fixed token counts. Paragraph breaks, section headers, and topic shifts define chunk boundaries. Each chunk carries metadata: source document, document type, section path, creation date, last-modified date, and any structured identifiers (regulation numbers, batch IDs, device model numbers).
Overlap between adjacent chunks preserves context at boundaries typically 10-15% of the chunk size. This overlap is not arbitrary; we calibrate it per document type based on retrieval evaluation metrics.
Average chunk sizes vary by domain. Regulatory text, which is dense and self-referential, performs best at 300-500 tokens. Support documentation, which is more conversational, works well at 500-800 tokens. Structured records are kept as complete units regardless of length.
Hybrid Retrieval
Every query runs through both dense and sparse retrieval paths in parallel.
- Dense retrieval uses embedding models to find semantically similar chunks. This handles paraphrased queries, conceptual questions, and natural language that does not match the exact terminology in the corpus.
- Sparse retrieval uses BM25 to find chunks containing exact keyword matches. This handles regulation codes, device identifiers, batch numbers, and technical abbreviations that embedding models often misrepresent.
Results from both paths are merged using reciprocal rank fusion, which combines rankings without requiring score normalization between the two systems.
Cross-Encoder Re-ranking
The merged candidate set typically 20-40 chunks from hybrid retrieval passes through a cross-encoder re-ranker. Unlike the bi-encoder used for initial retrieval (which embeds queries and documents independently), the cross-encoder processes the query and each candidate chunk together, producing a more accurate relevance score.
Re-ranking is computationally expensive relative to initial retrieval, which is why we apply it only to the merged candidate set rather than the full index. The top 5-8 re-ranked chunks are passed to the language model as context.
Citation Grounding
Every generated response includes explicit source citations. Each claim is annotated with the specific chunk or chunks that support it, including document title, section, and page number where applicable. Users can click through to the source material to verify any statement.
This is not optional. In regulated industries, an AI-generated answer without a verifiable source is worse than no answer at all. Auditors do not accept “the AI said so” as a basis for compliance decisions.
Hybrid Search: Why We Always Use Both
The case for hybrid search is best illustrated with examples from production.
In fleet compliance, a dispatcher asks: “What are the HOS rules for team drivers under 49 CFR 395.8?” Dense search alone returns chunks about Hours of Service regulations generally useful, but not specific. It often misses the exact section because “49 CFR 395.8” is a string that embedding models do not represent with precision. Sparse BM25 search finds the exact regulatory section by keyword matching “49 CFR 395.8” and “team drivers.” The hybrid result returns both the specific regulation and semantically related guidance documents about team driver exceptions.
In pharmaceutical quality, an investigator asks about deviations related to batch “PX-2024-1187.” Dense search returns chunks about deviation investigations in general the right topic, but not the right batch. Sparse search retrieves every document mentioning that specific batch identifier. The combined result gives the investigator both the specific batch history and relevant procedural context from similar investigations.
In device management through WeGuard, an IT administrator asks: “What is the compliance status of Galaxy S24 devices in the APAC engineering group?” Dense search understands the conceptual query about compliance and device groups. Sparse search ensures the specific device model and group name are matched exactly. Neither search modality alone reliably returns the complete answer.
The performance difference is measurable. Across our production systems, hybrid retrieval with reciprocal rank fusion consistently achieves 15-25% higher recall@10 compared to dense-only retrieval, with the largest gains on queries containing technical identifiers or regulatory codes. The latency overhead of running both retrieval paths in parallel is negligible typically under 20 milliseconds additional per query.
We have not encountered a production domain where dense-only search was sufficient. If the corpus contains any structured identifiers, codes, or technical terminology and enterprise corpora always do hybrid search is not an optimization. It is a requirement.
Evaluation: The Part Nobody Talks About
Building a RAG pipeline is the first half of the problem. Knowing whether it works and continuing to know over time is the second half, and it receives far less attention than it deserves.
We evaluate RAG systems across three dimensions: retrieval quality, generation quality, and operational stability.
Retrieval Metrics
For every query, we measure:
- Precision@k: Of the top k retrieved chunks, how many are actually relevant? This tells us whether the retrieval pipeline is surfacing noise.
- Recall@k: Of all relevant chunks in the corpus, how many appear in the top k results? This tells us whether the pipeline is missing important context.
- Mean Reciprocal Rank (MRR): How high in the ranked list does the first relevant chunk appear? This matters because language models attend more strongly to context that appears earlier in the prompt.
These metrics require relevance judgments, which we generate through a combination of automated methods (using a separate LLM as a judge on a held-out evaluation set) and periodic human annotation.
Generation Metrics
Retrieval quality alone does not guarantee good answers. We evaluate the generation stage on three criteria:
- Faithfulness: Does the generated answer use only information present in the retrieved context? Hallucinated claims that sound plausible but do not appear in any retrieved chunk are the highest-risk failure mode.
- Relevance: Does the answer actually address the user’s question? A faithful answer that summarizes retrieved content but misses the point of the query is still a failure.
- Groundedness: Can every factual claim in the response be traced to a specific source chunk? This is the citation verification step, automated through claim decomposition and source matching.
Human-in-the-Loop Evaluation
In high-stakes domains pharmaceutical quality and fleet compliance, specifically automated metrics are necessary but not sufficient. We run periodic human evaluation cycles where domain experts review a random sample of production queries and rate the responses for accuracy, completeness, and safety.
For pharmaceutical quality, every response related to regulatory interpretation or deviation classification is flagged for expert review during an initial supervised deployment period. The system earns autonomy incrementally as error rates fall below defined thresholds.
Drift Detection
RAG quality is not static. Corpus updates, query distribution shifts, and embedding model behavior changes can all cause gradual degradation. We run automated monitoring that tracks retrieval and generation metrics on a rolling basis and triggers alerts when any metric drops below a defined threshold.
Common drift patterns we have observed include: new document types that the chunking pipeline handles poorly, seasonal shifts in query patterns (fleet compliance queries spike before DOT audit periods), and embedding quality degradation on newly added technical terminology.
Building for Production
RAG is not a solved problem. It is an engineering discipline one that requires the same rigor applied to any production data system. The difference between a demo and a production RAG pipeline is not the embedding model or the vector database. It is the investment in document processing, retrieval evaluation, hybrid search, and continuous monitoring.
Every domain we have worked across device management, fleet operations, pharmaceutical quality has reinforced the same lesson. The retrieval layer is the foundation. If retrieval is unreliable, no amount of prompt engineering or model selection will compensate. Build the retrieval pipeline first, instrument it thoroughly, evaluate it continuously, and treat it as the core engineering challenge that it is.
The organizations that will succeed with enterprise RAG are not those with the most sophisticated models. They are those that treat knowledge retrieval as a first-class engineering problem and invest accordingly.