RAG (Retrieval-Augmented Generation) is now the default pattern for building LLM applications that need up-to-date or private knowledge. I've built RAG systems for clients ranging from a 10,000-document legal database to a real-time product catalog Q&A with 500k+ SKUs. Every time I start a new project, I see the same mistakes made in tutorials — mistakes that don't matter for demos but destroy accuracy in production.
This post skips the hello-world. It covers the architecture decisions, chunking strategies, retrieval tuning, and FastAPI wiring that actually matter when users depend on correct answers.
RAG vs Fine-tuning: Make the Right Call First
Before writing a line of code, choose the right approach. Most teams default to RAG when fine-tuning would be better, and vice versa.
| Use Case | Approach | Why |
|---|---|---|
| Private documents, knowledge base | RAG | Data changes; no retraining needed |
| Domain-specific tone or style | Fine-tuning | Behavior, not knowledge |
| Real-time or frequently updated data | RAG | Can't retrain hourly |
| Reasoning on structured company data | RAG + SQL agent | Structured data needs structured retrieval |
| Consistent output format | Fine-tuning | Format is a behavior pattern |
| Factual Q&A over static corpus | RAG | Classic RAG use case |
If your goal is factual Q&A over a document set that changes monthly or faster, RAG wins. If you need the model to behave differently (not just know different things), fine-tune.
Architecture: What Production RAG Actually Looks Like
User Query
│
▼
Query Rewriting (optional but recommended)
│
▼
Hybrid Retrieval (dense + sparse)
├── ChromaDB (vector similarity)
└── BM25 (keyword matching)
│
▼
Reranking (cross-encoder)
│
▼
Context Assembly + Prompt
│
▼
LLM (Claude / GPT-4o)
│
▼
Response + Source CitationsMost tutorials only show the dense retrieval step (vector similarity). Production systems use hybrid retrieval plus a reranker. The difference in answer quality is substantial — typically 15-25% improvement in answer relevance on internal benchmarks.
Step 1: Document Ingestion Pipeline
Build a proper ingestion pipeline that handles multiple formats and tracks what's been indexed:
from langchain_community.document_loaders import (
PyPDFLoader, TextLoader, WebBaseLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import hashlib, json, os
def get_doc_hash(content: str) -> str:
return hashlib.sha256(content.encode()).hexdigest()[:16]
def load_document(path: str):
ext = os.path.splitext(path)[1].lower()
loaders = {
".pdf": PyPDFLoader,
".txt": TextLoader,
}
loader_cls = loaders.get(ext)
if not loader_cls:
raise ValueError(f"Unsupported format: {ext}")
return loader_cls(path).load()
def ingest(paths: list[str], vectorstore: Chroma, indexed_hashes: set):
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", " ", ""],
)
for path in paths:
docs = load_document(path)
full_text = " ".join(d.page_content for d in docs)
doc_hash = get_doc_hash(full_text)
if doc_hash in indexed_hashes:
continue # skip already-indexed docs
chunks = splitter.split_documents(docs)
for chunk in chunks:
chunk.metadata["source"] = path
chunk.metadata["doc_hash"] = doc_hash
vectorstore.add_documents(chunks)
indexed_hashes.add(doc_hash)
print(f"Indexed {len(chunks)} chunks from {path}")Step 2: Chunking Strategy — Where Most RAG Systems Fail
Chunk size is the single biggest factor in RAG accuracy, and almost every tutorial gets it wrong by using 1000-token chunks with no overlap. Here's what actually works:
| Document Type | Chunk Size | Overlap | Reasoning |
|---|---|---|---|
| Dense technical docs (APIs, legal) | 600-900 tokens | 150-200 tokens | Preserves context without diluting retrieval |
| Conversational content (FAQs, chat logs) | 200-400 tokens | 50 tokens | Shorter context = more precise match |
| Long-form articles / books | 800-1200 tokens | 200 tokens | Needs context for coherence |
| Structured data (tables, code) | Per-row or per-function | 0-50 tokens | Semantic boundaries are structural |
The overlap is critical. Without it, a sentence that spans a chunk boundary can be split mid-thought — making both chunks less retrievable for queries that would have matched the complete sentence.
Step 3: Hybrid Retrieval with ChromaDB and BM25
Vector search alone misses exact keyword matches. BM25 alone misses semantic similarity. Combine them:
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_openai import OpenAIEmbeddings
embedding_fn = OpenAIEmbeddings(model="text-embedding-3-small")
# Dense retriever (semantic)
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embedding_fn,
)
dense_retriever = vectorstore.as_retriever(
search_type="mmr", # Max Marginal Relevance — reduces redundancy
search_kwargs={"k": 8, "fetch_k": 20},
)
# Sparse retriever (keyword/BM25)
# Build from the same documents you indexed
all_docs = vectorstore.get()["documents"]
sparse_retriever = BM25Retriever.from_texts(
all_docs, k=8
)
# Ensemble: 60% dense, 40% sparse
hybrid_retriever = EnsembleRetriever(
retrievers=[dense_retriever, sparse_retriever],
weights=[0.6, 0.4],
)MMR (Max Marginal Relevance) in the dense retriever is another underused setting — it trades a small amount of relevance for diversity, preventing the retriever from returning 8 nearly-identical chunks when you only need 3.
Step 4: The Retrieval Chain with Query Rewriting
User queries are messy. A short follow-up like "what about the pricing?" has no context for the retriever. Rewrite it before retrieval:
from langchain_core.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", temperature=0)
# Query rewriter — expands ambiguous queries using chat history
rewrite_prompt = ChatPromptTemplate.from_messages([
("system", "Rewrite the user query to be self-contained and specific, "
"incorporating relevant context from the chat history. "
"Output ONLY the rewritten query, nothing else."),
("human", "Chat history:\n{history}\n\nCurrent query: {query}"),
])
rewriter = rewrite_prompt | llm | StrOutputParser()
# Answer chain
answer_prompt = ChatPromptTemplate.from_messages([
("system",
"You are a precise assistant. Answer the question using ONLY "
"the provided context. If the answer is not in the context, say so. "
"Always cite the source document name.\n\nContext:\n{context}"),
("human", "{question}"),
])
def format_docs(docs):
return "\n\n---\n\n".join(
f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
for d in docs
)
rag_chain = (
{"context": hybrid_retriever | format_docs, "question": RunnablePassthrough()}
| answer_prompt
| llm
| StrOutputParser()
)Step 5: FastAPI Service with Streaming
Wrap the chain in a FastAPI endpoint with streaming so users see the response as it generates:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from langchain_core.messages import HumanMessage
import asyncio
app = FastAPI()
class QueryRequest(BaseModel):
query: str
history: list[dict] = []
@app.post("/query")
async def query_docs(req: QueryRequest):
history_text = "\n".join(
f"{m['role'].upper()}: {m['content']}"
for m in req.history[-6:] # last 3 turns
)
# Rewrite query if there's history
if history_text:
rewritten = await rewriter.ainvoke({
"history": history_text,
"query": req.query,
})
else:
rewritten = req.query
async def stream():
async for chunk in rag_chain.astream(rewritten):
yield f"data: {chunk}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(stream(), media_type="text/event-stream")
@app.get("/health")
def health():
return {"status": "ok", "docs_indexed": vectorstore._collection.count()}Evaluation: How to Know If It's Actually Working
The biggest mistake after building a RAG system is shipping it without evaluation. I use three metrics:
- —Context Precision — of the retrieved chunks, how many were actually relevant to the answer? Low precision = hallucination risk.
- —Context Recall — did the retriever find all the chunks needed to answer completely? Low recall = incomplete answers.
- —Answer Faithfulness — does the final answer only use information from the retrieved context? Unfaithful answers = hallucinations.
The RAGAS library automates all three using an LLM judge:
from ragas import evaluate
from ragas.metrics import (
context_precision, context_recall, faithfulness
)
from datasets import Dataset
# Prepare evaluation set: question + ground truth + retrieved context + answer
eval_data = {
"question": ["What is the refund policy?", ...],
"ground_truth": ["Items can be returned within 30 days...", ...],
"contexts": [[chunk1, chunk2], ...], # retrieved chunks
"answer": ["Based on the docs, refunds are...", ...],
}
result = evaluate(
Dataset.from_dict(eval_data),
metrics=[context_precision, context_recall, faithfulness],
)
print(result)
# {"context_precision": 0.87, "context_recall": 0.91, "faithfulness": 0.94}On a production legal Q&A system I built, baseline retrieval scored 0.61 faithfulness. After adding hybrid retrieval + reranking, it hit 0.94. That difference is the line between a demo and a system lawyers will actually trust.
Production Checklist
- —Chunk metadata: Always store source, page number, and doc hash. Without it you can't cite sources or debug bad answers.
- —Embedding model consistency: Never change embedding models mid-project without re-indexing everything. Mixing embedding spaces destroys retrieval.
- —Streaming from the start: Even 2-second waits feel broken to users. Stream every response.
- —Rate limiting on the API: LLM costs scale fast under load. Add per-user or per-IP rate limiting from day one.
- —Persist your vector store: ChromaDB's in-memory mode is for testing only. Use
persist_directoryalways. - —Test with adversarial queries: Queries that have no answer in the corpus. A well-tuned RAG should say "I don't know" — not hallucinate.
Frequently Asked Questions
Which embedding model should I use?
text-embedding-3-small from OpenAI is the best cost-to-quality ratio for English documents — 62.3% MTEB score at a fraction of large model cost. For multilingual content, use multilingual-e5-large. Avoid ada-002; it's obsolete.
ChromaDB vs Pinecone vs pgvector?
ChromaDB for local/small-scale (under 1M vectors). Pinecone for managed production with high QPS. pgvector if you're already on PostgreSQL and want to avoid another service. For most projects under 500k documents, ChromaDB with persistence is perfectly sufficient.
How do you handle documents that update frequently?
Track document hashes at ingestion time (shown in Step 1). On update, delete the old chunks by doc_hash metadata filter and re-ingest. ChromaDB supports metadata filtering for this: <code className='text-blue-300 text-xs'>vectorstore.delete(where={'doc_hash': old_hash})</code>.
What's the biggest cause of bad RAG answers in production?
Chunking — nearly every time. Too-large chunks dilute the signal so the retriever can't find the right passage. Too-small chunks lose the context needed to understand the passage. Start with 700 tokens / 150 overlap and tune from there using your RAGAS evaluation scores.
