RAG / LLM — Basics
RAG / LLM Integration — Basics
Section titled “RAG / LLM Integration — Basics”What RAG is
Section titled “What RAG is”Retrieval-Augmented Generation: instead of relying on an LLM’s training data, fetch relevant context from your own knowledge base, inject into the prompt, then generate. Solves:
- LLM stale knowledge cutoffs.
- Hallucinations on niche domains.
- Citing source (“according to doc X”).
- Without retraining.
Pipeline:
docs → chunk → embed → store in vector DBquery → embed → similarity search → rerank → assemble prompt → LLM → answerComponents
Section titled “Components”- Document loader — parse PDFs/HTML/markdown.
- Chunker — split into ~500-1500 token pieces with overlap.
- Embedder — model that turns text into vectors (OpenAI text-embedding-3-large, Cohere embed-v3, Voyage, BGE).
- Vector DB — Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector, OpenSearch, Elasticsearch dense_vector.
- Retriever — given a query, fetch top-k similar chunks. Hybrid = vector + keyword (BM25).
- Reranker — cross-encoder re-scores top-k. BGE reranker, Cohere rerank, Voyage rerank.
- LLM — GPT-4/4.5/4.7, Claude, Gemini, Llama 3, Mistral.
- Prompt template — assemble retrieved context + user question + system instructions.
Embeddings
Section titled “Embeddings”A vector that captures meaning. Similar sentences → close vectors (cosine distance).
Dimensions: 384 (small), 768, 1024, 1536, 3072 (typical OpenAI). Higher = more nuance, more memory.
Embedding model trade-offs:
- OpenAI ada/3-small/3-large — easy, paid per token, hosted.
- Cohere embed-v3 — strong English + multilingual.
- BGE (BAAI) — open source, runs on your hardware.
- Voyage — high quality, focused on retrieval.
Chunking strategies
Section titled “Chunking strategies”- Fixed-size — N tokens, often 500-1500 with 10-20% overlap.
- Semantic — split at sentence/paragraph boundaries; merge until size limit.
- Document structure — split by headings, sections.
- Recursive — try big delimiter, fall back to smaller.
- LLM-based — semantic chunking by embedding similarity changes.
Trade-offs:
- Smaller chunks = precise retrieval, may lack context.
- Larger chunks = more context, more noise, dilutes vector.
Add metadata: source URL, headings, timestamp, version. Use for filtering + citations.
Retrieval
Section titled “Retrieval”- Pure vector: top-k by cosine similarity. Fast, semantic.
- BM25: lexical / keyword match. Catches exact terms (model numbers, acronyms).
- Hybrid: combine via Reciprocal Rank Fusion (RRF) or weighted score. Production default.
- Filtering: by metadata (date range, source type) before similarity.
- Multi-query / HyDE: rewrite question, embed multiple variants, union results.
Reranking
Section titled “Reranking”After retrieving 20-50 candidates, run a cross-encoder that takes (query, chunk) pair → relevance score. Far better than bi-encoder cosine sim, but slower (one transformer pass per pair).
Pattern: retrieve top-50 → rerank to top-5 → put in prompt.
Tools: BGE reranker, Cohere Rerank, Voyage rerank-2, in-house cross-encoder.
Prompting
Section titled “Prompting”You are an assistant answering only from the provided context.If the context doesn't contain the answer, say "I don't know".
CONTEXT:[chunk 1][chunk 2]
QUESTION: {user question}
ANSWER:Add citations: “Cite source by [doc_id]”. Adds traceability.
LLM choice
Section titled “LLM choice”- Quality: GPT-4-class, Claude Opus, Gemini Pro.
- Cost-effective: GPT-4-mini, Claude Haiku, Gemini Flash, Llama 3.
- Self-hosted: Llama 3, Mistral, Qwen, DeepSeek via vLLM, llama.cpp, TGI.
For RAG: model with strong instruction-following and long context matters more than raw knowledge.
Evaluation
Section titled “Evaluation”- RAGAS: faithfulness, answer relevancy, context precision, context recall.
- DeepEval, Phoenix, Langfuse for production tracing.
- Manual: golden questions + expected answers; track regressions per release.
Frameworks
Section titled “Frameworks”- LangChain — popular, sprawling. Useful primitives, easy to lose control.
- LlamaIndex — focused on retrieval/indexing.
- Haystack — production-grade, modular.
- DSPy — programmatic, optimizes prompts/pipelines.
- DIY — often best long-term: thin wrappers around vector DB + LLM API. Less magic.
Common production concerns
Section titled “Common production concerns”- Hallucinations — LLM ignores context and invents. Mitigations: prompt explicit, low temperature, citation requirement, eval gates, smaller more reliable model for grounded tasks.
- Cost — embeddings + LLM tokens add up. Cache, batch, smaller models, hybrid retrieval.
- Latency — embed query + vector search + rerank + LLM. p99 often dominated by LLM. Stream output.
- Stale data — schedule re-embedding on doc updates.
- Multi-tenant data isolation — namespace per tenant, RBAC on metadata filters.
- Privacy — don’t send PII to hosted LLM without DPA; consider self-hosted.