RAG / LLM — Theory

RAG / LLM — Theory (interview deep-dive)

“Embed PDFs, top-k cosine, stuff in prompt” — works on demo, breaks on:

Production-grade RAG = hybrid retrieval + reranking + structured prompting + evaluation.

Don’t fine-tune for facts that change. Fine-tuning is brittle for knowledge updates.

	Pinecone	Weaviate	Qdrant	Milvus	pgvector
Hosting	managed	both	both	both	self/PG
Hybrid search	yes	yes	yes	yes	recent
Filtering	yes	yes	strong	yes	yes
Scale	huge	large	large	huge	medium
Maturity	high	high	high	high	medium

For starting: pgvector (already on Postgres) or Qdrant. For scale: Milvus, Pinecone.

Reciprocal Rank Fusion combines multiple ranked lists:

score(d) = Σ over lists  1 / (k + rank_i(d))

k=60 typical. Fuses vector top-N with BM25 top-N → better than either alone.

Bi-encoder: encode query and doc separately → cosine similarity. Fast (vectors precomputed).
Cross-encoder: encode (query, doc) pair jointly → score. Slow (one inference per pair).

Use bi-encoder for retrieval (fast top-k), cross-encoder for reranking the top-k (precise but few pairs).

LLMs have growing context (200k, 1M) but:

Best practice: 5-10 well-chosen chunks, reranked, with most relevant at the end.

For complex questions, single retrieval often misses. Patterns:

These help quality, hurt cost + latency. Use selectively for hard queries.

Exact match cache — same query → same answer. Useful only for popular questions.
Semantic cache — embed query, look up similar past queries. Higher hit rate, accuracy risks.
Retrieval cache — same query → same retrieved chunks. Cheaper than recomputing.

Per request: ~ embedding (cheap) + retrieval (cheap) + reranker (mid) + LLM (most cost).

Levers:

Token math: 4 chars ~ 1 token. 5k context + 500 output @ $0.01/1k = ~$0.055/req.

Walk through your RAG pipeline. Above.
N+1 queries growing as documents grow — fix? It’s not about queries; it’s about retrieval quality. Improve chunking, hybrid, rerank, filters.
Citing sources? Include source metadata in chunks; LLM instructed to cite by ID; UI maps ID → URL.
PII in documents? PII detection + redaction before embedding; or self-hosted LLM with strict DPA.
Eval framework? Golden set + RAGAS (faithfulness, answer relevancy, context precision/recall) + manual review per release.
Embedding choice? Tradeoffs: hosted vs self, dimension, cost, multilingual. Trial top-3 against your golden set; pick on retrieval recall.
Why might cosine similarity miss the right chunk? Acronyms, exact codes, numbers, unusual proper nouns. BM25 catches these.
Stale embeddings? CDC from doc store → re-embed pipeline. Track vector DB doc version.
Multi-tenant isolation in vector DB? Namespace per tenant + metadata filter; verify zero leakage in tests.
You see 60% recall but 30% precision on your eval. What now? Tighten retrieval (more filters, better embed model); add reranker; compress context.

Agentic RAG — agent decides what to retrieve, in what order.
GraphRAG — knowledge graph + RAG (Microsoft technique).
Long-context LLM — sometimes obviates RAG for small corpora.
Multi-modal RAG — embed images / tables / PDFs as multi-modal vectors.
Tool use — let LLM call functions (search, DB) instead of (or in addition to) retrieval.
Function calling / structured output — JSON mode for reliable downstream parsing.
Distillation — train a smaller model from a stronger one for cost.