Skip to content

RAG / LLM — Theory

RAG / LLM — Theory (interview deep-dive)

Section titled “RAG / LLM — Theory (interview deep-dive)”

“Embed PDFs, top-k cosine, stuff in prompt” — works on demo, breaks on:

  • Acronyms / SKUs / proper nouns (vector is bad at exact terms).
  • Multi-document reasoning.
  • Numerical / tabular data.
  • Disambiguation between similar documents.
  • Long-tail queries that need filtering, not similarity.

Production-grade RAG = hybrid retrieval + reranking + structured prompting + evaluation.

  • RAG — knowledge that changes / is large / is private; need citations.
  • Fine-tune — style, format, behavior change; small specific behavior shifts.
  • Both — fine-tune base for domain language, then RAG for facts.

Don’t fine-tune for facts that change. Fine-tuning is brittle for knowledge updates.

PineconeWeaviateQdrantMilvuspgvector
Hostingmanagedbothbothbothself/PG
Hybrid searchyesyesyesyesrecent
Filteringyesyesstrongyesyes
Scalehugelargelargehugemedium
Maturityhighhighhighhighmedium

For starting: pgvector (already on Postgres) or Qdrant. For scale: Milvus, Pinecone.

Reciprocal Rank Fusion combines multiple ranked lists:

score(d) = Σ over lists 1 / (k + rank_i(d))

k=60 typical. Fuses vector top-N with BM25 top-N → better than either alone.

  • Bi-encoder: encode query and doc separately → cosine similarity. Fast (vectors precomputed).
  • Cross-encoder: encode (query, doc) pair jointly → score. Slow (one inference per pair).

Use bi-encoder for retrieval (fast top-k), cross-encoder for reranking the top-k (precise but few pairs).

LLMs have growing context (200k, 1M) but:

  • “Lost in the middle”: models attend best to start and end, miss middle.
  • Larger context = more cost + latency.
  • More context ≠ better answer; signal-to-noise matters.

Best practice: 5-10 well-chosen chunks, reranked, with most relevant at the end.

  • System prompt: “answer only from context”.
  • Add citation requirement.
  • Low temperature.
  • Confidence threshold — abstain if no good chunk.
  • Fact-check with second pass (LLM-as-judge).
  • Eval with RAGAS faithfulness metric in CI.

For complex questions, single retrieval often misses. Patterns:

  • Decompose — LLM breaks question into sub-questions, retrieve per sub-q.
  • HyDE — LLM writes a hypothetical answer, embed that, retrieve neighbors.
  • Self-query — LLM extracts metadata filter + semantic query.
  • ReAct — LLM iteratively decides which tool to use until enough info.

These help quality, hurt cost + latency. Use selectively for hard queries.

  • Exact match cache — same query → same answer. Useful only for popular questions.
  • Semantic cache — embed query, look up similar past queries. Higher hit rate, accuracy risks.
  • Retrieval cache — same query → same retrieved chunks. Cheaper than recomputing.

Per request: ~ embedding (cheap) + retrieval (cheap) + reranker (mid) + LLM (most cost).

Levers:

  • Smaller LLM via routing — easy queries to mini, hard to flagship.
  • Cap context tokens.
  • Cache responses.
  • Batch where latency permits.

Token math: 4 chars ~ 1 token. 5k context + 500 output @ $0.01/1k = ~$0.055/req.

  1. Walk through your RAG pipeline. Above.
  2. N+1 queries growing as documents grow — fix? It’s not about queries; it’s about retrieval quality. Improve chunking, hybrid, rerank, filters.
  3. Citing sources? Include source metadata in chunks; LLM instructed to cite by ID; UI maps ID → URL.
  4. PII in documents? PII detection + redaction before embedding; or self-hosted LLM with strict DPA.
  5. Eval framework? Golden set + RAGAS (faithfulness, answer relevancy, context precision/recall) + manual review per release.
  6. Embedding choice? Tradeoffs: hosted vs self, dimension, cost, multilingual. Trial top-3 against your golden set; pick on retrieval recall.
  7. Why might cosine similarity miss the right chunk? Acronyms, exact codes, numbers, unusual proper nouns. BM25 catches these.
  8. Stale embeddings? CDC from doc store → re-embed pipeline. Track vector DB doc version.
  9. Multi-tenant isolation in vector DB? Namespace per tenant + metadata filter; verify zero leakage in tests.
  10. You see 60% recall but 30% precision on your eval. What now? Tighten retrieval (more filters, better embed model); add reranker; compress context.
  • Agentic RAG — agent decides what to retrieve, in what order.
  • GraphRAG — knowledge graph + RAG (Microsoft technique).
  • Long-context LLM — sometimes obviates RAG for small corpora.
  • Multi-modal RAG — embed images / tables / PDFs as multi-modal vectors.
  • Tool use — let LLM call functions (search, DB) instead of (or in addition to) retrieval.
  • Function calling / structured output — JSON mode for reliable downstream parsing.
  • Distillation — train a smaller model from a stronger one for cost.