RAG / LLM — Basics

RAG / LLM Integration — Basics

What RAG is

Retrieval-Augmented Generation: instead of relying on an LLM’s training data, fetch relevant context from your own knowledge base, inject into the prompt, then generate. Solves:

LLM stale knowledge cutoffs.
Hallucinations on niche domains.
Citing source (“according to doc X”).
Without retraining.

Pipeline:

docs → chunk → embed → store in vector DB
query → embed → similarity search → rerank → assemble prompt → LLM → answer

Components

Document loader — parse PDFs/HTML/markdown.
Chunker — split into ~500-1500 token pieces with overlap.
Embedder — model that turns text into vectors (OpenAI text-embedding-3-large, Cohere embed-v3, Voyage, BGE).
Vector DB — Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector, OpenSearch, Elasticsearch dense_vector.
Retriever — given a query, fetch top-k similar chunks. Hybrid = vector + keyword (BM25).
Reranker — cross-encoder re-scores top-k. BGE reranker, Cohere rerank, Voyage rerank.
LLM — GPT-4/4.5/4.7, Claude, Gemini, Llama 3, Mistral.
Prompt template — assemble retrieved context + user question + system instructions.

Embeddings

A vector that captures meaning. Similar sentences → close vectors (cosine distance).

Dimensions: 384 (small), 768, 1024, 1536, 3072 (typical OpenAI). Higher = more nuance, more memory.

Embedding model trade-offs:

OpenAI ada/3-small/3-large — easy, paid per token, hosted.
Cohere embed-v3 — strong English + multilingual.
BGE (BAAI) — open source, runs on your hardware.
Voyage — high quality, focused on retrieval.

Chunking strategies

Fixed-size — N tokens, often 500-1500 with 10-20% overlap.
Semantic — split at sentence/paragraph boundaries; merge until size limit.
Document structure — split by headings, sections.
Recursive — try big delimiter, fall back to smaller.
LLM-based — semantic chunking by embedding similarity changes.

Trade-offs:

Smaller chunks = precise retrieval, may lack context.
Larger chunks = more context, more noise, dilutes vector.

Add metadata: source URL, headings, timestamp, version. Use for filtering + citations.

Retrieval

Pure vector: top-k by cosine similarity. Fast, semantic.
BM25: lexical / keyword match. Catches exact terms (model numbers, acronyms).
Hybrid: combine via Reciprocal Rank Fusion (RRF) or weighted score. Production default.
Filtering: by metadata (date range, source type) before similarity.
Multi-query / HyDE: rewrite question, embed multiple variants, union results.

Reranking

After retrieving 20-50 candidates, run a cross-encoder that takes (query, chunk) pair → relevance score. Far better than bi-encoder cosine sim, but slower (one transformer pass per pair).

Pattern: retrieve top-50 → rerank to top-5 → put in prompt.

Tools: BGE reranker, Cohere Rerank, Voyage rerank-2, in-house cross-encoder.

Prompting

You are an assistant answering only from the provided context.
If the context doesn't contain the answer, say "I don't know".

CONTEXT:
[chunk 1]
[chunk 2]

QUESTION: {user question}

ANSWER:

Add citations: “Cite source by [doc_id]”. Adds traceability.

LLM choice

Quality: GPT-4-class, Claude Opus, Gemini Pro.
Cost-effective: GPT-4-mini, Claude Haiku, Gemini Flash, Llama 3.
Self-hosted: Llama 3, Mistral, Qwen, DeepSeek via vLLM, llama.cpp, TGI.

For RAG: model with strong instruction-following and long context matters more than raw knowledge.

Evaluation

RAGAS: faithfulness, answer relevancy, context precision, context recall.
DeepEval, Phoenix, Langfuse for production tracing.
Manual: golden questions + expected answers; track regressions per release.

Frameworks

LangChain — popular, sprawling. Useful primitives, easy to lose control.
LlamaIndex — focused on retrieval/indexing.
Haystack — production-grade, modular.
DSPy — programmatic, optimizes prompts/pipelines.
DIY — often best long-term: thin wrappers around vector DB + LLM API. Less magic.

Common production concerns

Hallucinations — LLM ignores context and invents. Mitigations: prompt explicit, low temperature, citation requirement, eval gates, smaller more reliable model for grounded tasks.
Cost — embeddings + LLM tokens add up. Cache, batch, smaller models, hybrid retrieval.
Latency — embed query + vector search + rerank + LLM. p99 often dominated by LLM. Stream output.
Stale data — schedule re-embedding on doc updates.
Multi-tenant data isolation — namespace per tenant, RBAC on metadata filters.
Privacy — don’t send PII to hosted LLM without DPA; consider self-hosted.