Skip to content

RAG / LLM — Basics

Retrieval-Augmented Generation: instead of relying on an LLM’s training data, fetch relevant context from your own knowledge base, inject into the prompt, then generate. Solves:

  • LLM stale knowledge cutoffs.
  • Hallucinations on niche domains.
  • Citing source (“according to doc X”).
  • Without retraining.

Pipeline:

docs → chunk → embed → store in vector DB
query → embed → similarity search → rerank → assemble prompt → LLM → answer
  • Document loader — parse PDFs/HTML/markdown.
  • Chunker — split into ~500-1500 token pieces with overlap.
  • Embedder — model that turns text into vectors (OpenAI text-embedding-3-large, Cohere embed-v3, Voyage, BGE).
  • Vector DB — Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector, OpenSearch, Elasticsearch dense_vector.
  • Retriever — given a query, fetch top-k similar chunks. Hybrid = vector + keyword (BM25).
  • Reranker — cross-encoder re-scores top-k. BGE reranker, Cohere rerank, Voyage rerank.
  • LLM — GPT-4/4.5/4.7, Claude, Gemini, Llama 3, Mistral.
  • Prompt template — assemble retrieved context + user question + system instructions.

A vector that captures meaning. Similar sentences → close vectors (cosine distance).

Dimensions: 384 (small), 768, 1024, 1536, 3072 (typical OpenAI). Higher = more nuance, more memory.

Embedding model trade-offs:

  • OpenAI ada/3-small/3-large — easy, paid per token, hosted.
  • Cohere embed-v3 — strong English + multilingual.
  • BGE (BAAI) — open source, runs on your hardware.
  • Voyage — high quality, focused on retrieval.
  • Fixed-size — N tokens, often 500-1500 with 10-20% overlap.
  • Semantic — split at sentence/paragraph boundaries; merge until size limit.
  • Document structure — split by headings, sections.
  • Recursive — try big delimiter, fall back to smaller.
  • LLM-based — semantic chunking by embedding similarity changes.

Trade-offs:

  • Smaller chunks = precise retrieval, may lack context.
  • Larger chunks = more context, more noise, dilutes vector.

Add metadata: source URL, headings, timestamp, version. Use for filtering + citations.

  • Pure vector: top-k by cosine similarity. Fast, semantic.
  • BM25: lexical / keyword match. Catches exact terms (model numbers, acronyms).
  • Hybrid: combine via Reciprocal Rank Fusion (RRF) or weighted score. Production default.
  • Filtering: by metadata (date range, source type) before similarity.
  • Multi-query / HyDE: rewrite question, embed multiple variants, union results.

After retrieving 20-50 candidates, run a cross-encoder that takes (query, chunk) pair → relevance score. Far better than bi-encoder cosine sim, but slower (one transformer pass per pair).

Pattern: retrieve top-50 → rerank to top-5 → put in prompt.

Tools: BGE reranker, Cohere Rerank, Voyage rerank-2, in-house cross-encoder.

You are an assistant answering only from the provided context.
If the context doesn't contain the answer, say "I don't know".
CONTEXT:
[chunk 1]
[chunk 2]
QUESTION: {user question}
ANSWER:

Add citations: “Cite source by [doc_id]”. Adds traceability.

  • Quality: GPT-4-class, Claude Opus, Gemini Pro.
  • Cost-effective: GPT-4-mini, Claude Haiku, Gemini Flash, Llama 3.
  • Self-hosted: Llama 3, Mistral, Qwen, DeepSeek via vLLM, llama.cpp, TGI.

For RAG: model with strong instruction-following and long context matters more than raw knowledge.

  • RAGAS: faithfulness, answer relevancy, context precision, context recall.
  • DeepEval, Phoenix, Langfuse for production tracing.
  • Manual: golden questions + expected answers; track regressions per release.
  • LangChain — popular, sprawling. Useful primitives, easy to lose control.
  • LlamaIndex — focused on retrieval/indexing.
  • Haystack — production-grade, modular.
  • DSPy — programmatic, optimizes prompts/pipelines.
  • DIY — often best long-term: thin wrappers around vector DB + LLM API. Less magic.
  • Hallucinations — LLM ignores context and invents. Mitigations: prompt explicit, low temperature, citation requirement, eval gates, smaller more reliable model for grounded tasks.
  • Cost — embeddings + LLM tokens add up. Cache, batch, smaller models, hybrid retrieval.
  • Latency — embed query + vector search + rerank + LLM. p99 often dominated by LLM. Stream output.
  • Stale data — schedule re-embedding on doc updates.
  • Multi-tenant data isolation — namespace per tenant, RBAC on metadata filters.
  • Privacy — don’t send PII to hosted LLM without DPA; consider self-hosted.