Basic RAG tutorials make it look simple: chunk documents, embed them, retrieve the top-k results, generate an answer. A working prototype can be built in an afternoon. The mental model is clean documents go in, answers come out.
Production RAG for enterprise data is a different problem entirely. The documents are messy, the queries are ambiguous, the failure modes are silent, and the cost of wrong answers ranges from wasted time to regulatory violations. We learned this the hard way across multiple engagements spanning device management, fleet operations, and pharmaceutical quality systems.
This post walks through the specific architecture decisions we made building RAG pipelines for these domains and, more importantly, what we tried first that did not work. Every layer of our current pipeline exists because we observed a concrete failure that forced its introduction.
Document Processing: The First Three Attempts
We did not arrive at our current document processing strategy on the first try. We arrived at it on the third.
Attempt 1: Fixed-Size Chunking
Our first pipeline used fixed-size chunking at 512 tokens per chunk. This is the default in most tutorials and frameworks, and for good reason: it is simple to implement, predictable in output, and fast to run. We had an end-to-end prototype working within hours.
The results were poor. Fixed-size chunking is content-agnostic. It does not know or care where sentences end, where paragraphs break, or where one topic transitions to another. Chunks split mid-sentence, mid-paragraph, and mid-concept. A regulatory document explaining a compliance requirement would be severed at an arbitrary token boundary, leaving the first chunk with the context and the second chunk with the actual requirement. Retrieval returned fragments that technically contained relevant keywords but lacked the surrounding context necessary to produce a useful answer.
For structured documents with nested section headers which describes nearly every regulatory document we encountered the results were especially bad. A chunk might contain the text of a subsection without any reference to the parent section or the regulation it belonged to.
Attempt 2: Recursive Character Splitting with Overlap
Our second approach used recursive character splitting with configurable overlap between adjacent chunks. This is a step above fixed-size chunking: the splitter attempts to break at paragraph boundaries first, then sentence boundaries, then word boundaries, falling back to character-level splits only as a last resort. We configured 15% overlap between chunks to preserve context at boundaries.
This was meaningfully better. Chunks respected sentence boundaries most of the time, and the overlap reduced the frequency of context loss at chunk edges. For simple, flat documents blog posts, support articles, straightforward technical documentation this approach produced adequate retrieval results.
It still failed on documents with complex structure. Regulatory documents with nested headers, pharmaceutical SOPs with numbered sections and sub-sections, and fleet compliance manuals with hierarchical reference structures all lost their organization. The splitter had no awareness of document structure beyond plain text formatting. A chunk from section 3.2.1 carried no indication that it belonged to section 3.2, which belonged to section 3, which belonged to a specific regulatory framework.
Attempt 3: Semantic Chunking with Metadata Enrichment
Our current approach chunks at natural section boundaries rather than arbitrary positions. Section headers, topic transitions, and structural markers in the source document define where chunks begin and end. Chunk sizes vary a short regulatory clause might produce a 150-token chunk, while a detailed procedural description might produce an 800-token chunk. We constrain the range to avoid extremes, but we let document structure drive the segmentation rather than imposing a fixed window.
The real improvement came from metadata enrichment. Every chunk carries structured metadata: source document identifier, document type, the full section path from root to the chunk location, publication or revision date, and a brief summary of the parent section. This metadata enables filtered retrieval at query time. A query about compliance regulations can be scoped to only search documents of type “regulation” published after a specific date. A query about a specific device model can be filtered to documentation tagged with that model identifier.
Metadata enrichment added processing complexity, but the retrieval quality gains were substantial. In our evaluations, semantic chunking with metadata consistently outperformed recursive splitting by 20-30% on recall metrics, with the largest improvements on queries requiring context from specific sections of structured documents.
Embedding Model Selection
Choosing an embedding model involves a three-way trade-off between quality, cost, and operational complexity. We evaluated several options across our production domains before settling on a strategy.
General-purpose embeddings from providers like OpenAI (text-embedding-3-large) established our quality baseline. Retrieval performance was strong across domains, and integration was straightforward. The drawback is cost at scale. When processing millions of document chunks across multiple client deployments, API-based embedding costs become a meaningful line item. There is also the operational consideration of depending on an external API for a core pipeline component.
Open-source alternatives such as BGE and E5 variants offered competitive quality with self-hosting flexibility. In our benchmarks, the best open-source models reached within 2-5% of the commercial options on general retrieval tasks. Self-hosting eliminates per-token costs and external dependencies, though it introduces infrastructure management overhead.
Domain-adapted embeddings produced the strongest results in established verticals. We fine-tuned embedding models on domain-specific query-document pairs gathered from production usage and expert annotation. In our pharmaceutical quality pipeline, domain-adapted embeddings improved retrieval precision by 12% over general-purpose models. In fleet compliance, the improvement was approximately 8%. The gains were concentrated on queries involving specialized terminology and domain-specific concepts that general models underrepresent.
The trade-off is clear: domain-adapted embeddings require labeled data, training infrastructure, and per-domain maintenance. We settled on a staged approach. New engagements begin with general-purpose embeddings to minimize setup time and validate the use case. Once the pipeline is in production and we have accumulated sufficient query-document relevance pairs, we evaluate whether domain adaptation is justified by the retrieval quality gaps we observe. Not every domain warrants it.
Vector Database Choice
We evaluated four vector database options: Qdrant, Weaviate, pgvector, and Pinecone. Each has legitimate strengths, and the “right” choice depends more on operational context than on raw performance benchmarks.
Qdrant is our default for dedicated deployments. Its filtering capabilities are strong critical for our metadata-rich retrieval strategy and performance remains stable at the scale of our current deployments (tens of millions of vectors). It is self-hostable, which matters for clients in regulated industries where data residency requirements prohibit external services. The API is well designed and the documentation is thorough.
pgvector is our recommendation when the client already operates PostgreSQL infrastructure. It eliminates the operational burden of running a separate vector database service. For smaller datasets (under a few million vectors) and moderate query throughput, pgvector performs adequately and simplifies the deployment architecture considerably. The trade-off is that its filtering and indexing capabilities are less mature than purpose-built vector databases.
Weaviate showed strong multi-tenancy support, which makes it a practical choice for SaaS-style deployments where multiple clients share infrastructure with data isolation requirements. We have used it in scenarios requiring tenant-level separation within a single cluster.
Pinecone is a managed service that reduces operational overhead to near zero. For teams without dedicated infrastructure engineering capacity, this matters. The trade-off is the loss of self-hosting control and the recurring cost structure.
Our general principle: the vector database choice matters less than the quality of data going into it. A well-chunked, metadata-rich corpus with strong embeddings will perform well in any competent vector database. A poorly processed corpus will perform poorly regardless of the database.
Hybrid Retrieval: The Architecture
Our production retrieval pipeline runs four stages in sequence. Each stage addresses a specific class of retrieval failure that we observed in earlier, simpler designs.
Stage 1: Parallel Dense and Sparse Search. Every query triggers two searches simultaneously. Dense retrieval uses embedding similarity to find semantically related chunks handling paraphrased queries, conceptual questions, and natural language that does not match the exact terminology in the corpus. Sparse retrieval uses BM25 keyword matching to find chunks containing exact terms handling regulation codes, device identifiers, batch numbers, and technical abbreviations.
Neither modality alone is sufficient. In fleet compliance, a query about “49 CFR 395.5” fails with dense-only search because regulation codes do not embed with precision. In pharmaceutical quality, a conceptual query about root cause analysis methodologies fails with sparse-only search because the relevant documents use varied terminology that keyword matching cannot unify. We run both in parallel, with negligible latency overhead.
Stage 2: Reciprocal Rank Fusion. The two result sets from dense and sparse retrieval are merged using reciprocal rank fusion (RRF). RRF combines rankings without requiring score normalization between the two retrieval systems, which is important because dense similarity scores and BM25 scores exist on incompatible scales. The fused ranking reliably surfaces chunks that are relevant by both semantic and lexical criteria, while preserving chunks that score highly on only one dimension.
Stage 3: Cross-Encoder Re-ranking. The fused candidate set typically 20 to 40 chunks passes through a cross-encoder model. Unlike the bi-encoder used in Stage 1, which embeds queries and documents independently, the cross-encoder processes the query and each candidate together, producing a more precise relevance judgment. Re-ranking is computationally expensive, which is why we apply it only to the already-filtered candidate set rather than the full corpus. The top 5 to 8 re-ranked chunks proceed to generation.
Stage 4: Metadata Filtering. The final retrieval stage applies metadata constraints: date ranges, document types, source authority levels, and any domain-specific filters specified in the query or inferred from context. A query about current compliance requirements should not retrieve superseded regulations. A query about a specific manufacturing batch should prioritize documents tagged with that batch identifier.
In our evaluations, removing any single stage degrades retrieval quality in predictable, measurable ways. The four-stage pipeline is not complexity for its own sake. It is the minimum architecture that handles the full range of query types we encounter in production.
Response Generation and Citation
Retrieved chunks are passed to the language model with explicit source labels. Each chunk is annotated with its document title, section path, and date. The generation prompt instructs the model to ground every claim in the provided context and to cite the specific source for each factual statement.
The critical design constraint: if the retrieved context does not contain information sufficient to answer the query, the system must say so rather than speculate. In regulated domains, a confident wrong answer is far more dangerous than an honest “we do not have enough information to answer this.” We enforce this through prompt design, output validation, and periodic human review of edge cases where the system declined to answer.
Citations follow an inline reference format. Each factual claim in the response maps to a numbered source, and users can click through to view the original document chunk in context. This is not a convenience feature it is a trust mechanism. Enterprise users, particularly in compliance and quality roles, will not adopt a system whose answers they cannot verify. Citation grounding transforms the system from a black box into an auditable tool.
We also run automated citation verification as a post-generation check. A separate validation step confirms that each cited source actually supports the claim it is attached to. Claims that fail verification are flagged or removed before the response reaches the user.
What We Learned
The difference between a RAG demo and a RAG system is in the decisions documented above. Each stage of the pipeline addresses a specific failure mode that we observed, diagnosed, and resolved through iterative engineering. Semantic chunking exists because fixed-size chunking destroyed document structure. Hybrid retrieval exists because dense-only search fails on exact identifiers. Cross-encoder re-ranking exists because initial retrieval rankings are noisy. Citation grounding exists because enterprise users require verifiable answers.
Remove any stage and quality degrades in predictable ways. The architecture is not complex for the sake of complexity. It is complex because enterprise data is complex, enterprise queries are diverse, and the cost of wrong answers is real. Every component earns its place by solving a problem that simpler approaches could not.