RAG / LLM — Theory
RAG / LLM — Theory (interview deep-dive)
Section titled “RAG / LLM — Theory (interview deep-dive)”Why naive RAG fails in production
Section titled “Why naive RAG fails in production”“Embed PDFs, top-k cosine, stuff in prompt” — works on demo, breaks on:
- Acronyms / SKUs / proper nouns (vector is bad at exact terms).
- Multi-document reasoning.
- Numerical / tabular data.
- Disambiguation between similar documents.
- Long-tail queries that need filtering, not similarity.
Production-grade RAG = hybrid retrieval + reranking + structured prompting + evaluation.
When RAG vs fine-tune vs both
Section titled “When RAG vs fine-tune vs both”- RAG — knowledge that changes / is large / is private; need citations.
- Fine-tune — style, format, behavior change; small specific behavior shifts.
- Both — fine-tune base for domain language, then RAG for facts.
Don’t fine-tune for facts that change. Fine-tuning is brittle for knowledge updates.
Vector DB choice
Section titled “Vector DB choice”| Pinecone | Weaviate | Qdrant | Milvus | pgvector | |
|---|---|---|---|---|---|
| Hosting | managed | both | both | both | self/PG |
| Hybrid search | yes | yes | yes | yes | recent |
| Filtering | yes | yes | strong | yes | yes |
| Scale | huge | large | large | huge | medium |
| Maturity | high | high | high | high | medium |
For starting: pgvector (already on Postgres) or Qdrant. For scale: Milvus, Pinecone.
Hybrid retrieval — RRF
Section titled “Hybrid retrieval — RRF”Reciprocal Rank Fusion combines multiple ranked lists:
score(d) = Σ over lists 1 / (k + rank_i(d))k=60 typical. Fuses vector top-N with BM25 top-N → better than either alone.
Cross-encoder vs bi-encoder
Section titled “Cross-encoder vs bi-encoder”- Bi-encoder: encode query and doc separately → cosine similarity. Fast (vectors precomputed).
- Cross-encoder: encode (query, doc) pair jointly → score. Slow (one inference per pair).
Use bi-encoder for retrieval (fast top-k), cross-encoder for reranking the top-k (precise but few pairs).
Context window strategy
Section titled “Context window strategy”LLMs have growing context (200k, 1M) but:
- “Lost in the middle”: models attend best to start and end, miss middle.
- Larger context = more cost + latency.
- More context ≠ better answer; signal-to-noise matters.
Best practice: 5-10 well-chosen chunks, reranked, with most relevant at the end.
Hallucination mitigation
Section titled “Hallucination mitigation”- System prompt: “answer only from context”.
- Add citation requirement.
- Low temperature.
- Confidence threshold — abstain if no good chunk.
- Fact-check with second pass (LLM-as-judge).
- Eval with RAGAS faithfulness metric in CI.
Multi-step retrieval (agentic)
Section titled “Multi-step retrieval (agentic)”For complex questions, single retrieval often misses. Patterns:
- Decompose — LLM breaks question into sub-questions, retrieve per sub-q.
- HyDE — LLM writes a hypothetical answer, embed that, retrieve neighbors.
- Self-query — LLM extracts metadata filter + semantic query.
- ReAct — LLM iteratively decides which tool to use until enough info.
These help quality, hurt cost + latency. Use selectively for hard queries.
Caching
Section titled “Caching”- Exact match cache — same query → same answer. Useful only for popular questions.
- Semantic cache — embed query, look up similar past queries. Higher hit rate, accuracy risks.
- Retrieval cache — same query → same retrieved chunks. Cheaper than recomputing.
Cost model
Section titled “Cost model”Per request: ~ embedding (cheap) + retrieval (cheap) + reranker (mid) + LLM (most cost).
Levers:
- Smaller LLM via routing — easy queries to mini, hard to flagship.
- Cap context tokens.
- Cache responses.
- Batch where latency permits.
Token math: 4 chars ~ 1 token. 5k context + 500 output @ $0.01/1k = ~$0.055/req.
Common interview Qs
Section titled “Common interview Qs”- Walk through your RAG pipeline. Above.
- N+1 queries growing as documents grow — fix? It’s not about queries; it’s about retrieval quality. Improve chunking, hybrid, rerank, filters.
- Citing sources? Include source metadata in chunks; LLM instructed to cite by ID; UI maps ID → URL.
- PII in documents? PII detection + redaction before embedding; or self-hosted LLM with strict DPA.
- Eval framework? Golden set + RAGAS (faithfulness, answer relevancy, context precision/recall) + manual review per release.
- Embedding choice? Tradeoffs: hosted vs self, dimension, cost, multilingual. Trial top-3 against your golden set; pick on retrieval recall.
- Why might cosine similarity miss the right chunk? Acronyms, exact codes, numbers, unusual proper nouns. BM25 catches these.
- Stale embeddings? CDC from doc store → re-embed pipeline. Track vector DB doc version.
- Multi-tenant isolation in vector DB? Namespace per tenant + metadata filter; verify zero leakage in tests.
- You see 60% recall but 30% precision on your eval. What now? Tighten retrieval (more filters, better embed model); add reranker; compress context.
Frontier topics (likely asked)
Section titled “Frontier topics (likely asked)”- Agentic RAG — agent decides what to retrieve, in what order.
- GraphRAG — knowledge graph + RAG (Microsoft technique).
- Long-context LLM — sometimes obviates RAG for small corpora.
- Multi-modal RAG — embed images / tables / PDFs as multi-modal vectors.
- Tool use — let LLM call functions (search, DB) instead of (or in addition to) retrieval.
- Function calling / structured output — JSON mode for reliable downstream parsing.
- Distillation — train a smaller model from a stronger one for cost.