Your AI doesn't know about your company, your docs, or last Tuesday.
RAG fixes that. In 30 minutes, you'll build a system that lets AI answer questions about any document.
For absolute beginners. No AI PhD required.
AI models are trained on data from the past. They know about Shakespeare, but not your company's policies. They know about Python, but not your codebase. They know about cats, but not what happened in your meeting last Tuesday.
Ask an AI about something it hasn't seen before, and it will **hallucinate** — make stuff up confidently.
This is completely made up! The AI has never seen Acme Corp's actual policy.
The AI sounds confident, but it's lying. It doesn't know it's lying — it's just predicting what a reasonable policy might look like.
// What the AI knows:
✅ General knowledge (Wikipedia, books, web)
✅ Programming concepts (GitHub, Stack Overflow)
✅ Common facts and patterns
// What the AI doesn't know:
❌ Your company's internal documents
❌ Recent events (after training cutoff)
❌ Personal information
❌ Proprietary data
❌ Real-time information
LLMs don't know your data. You have to give it to them. That's where RAG comes in.
**RAG = Retrieval-Augmented Generation.** Instead of just asking the AI, you:
You can't paste a whole book into the AI prompt. Context windows are limited (usually 4K-128K tokens), and longer prompts cost more and run slower.
The solution: **break documents into small pieces called "chunks."** Each chunk contains one coherent idea or topic.
Notice how overlap helps? If a sentence is split across chunks, the overlap ensures both chunks have enough context.
// Too small (< 100 tokens):
"Our company's vacation policy"
// Problem: No context, meaningless fragment
// Too large (> 1000 tokens):
[Entire 5-page policy document]
// Problem: Too much noise, key info gets lost
// Just right (200-500 tokens):
"Our company's vacation policy is designed to promote
work-life balance. All full-time employees earn 15 days
of paid vacation annually, starting from their hire date.
Vacation days must be requested at least two weeks in
advance through our HR system..."
// Perfect: Complete thought with context
The right chunk size is the difference between finding the answer and missing it.
Evaluating Ideal Chunk Size for RAG — Research on optimal chunking strategies
Each chunk needs to become a number so you can search it. But not just any number — a special kind of number that captures **meaning**.
**Embeddings** are like GPS coordinates, but for meaning instead of location. Similar ideas get similar coordinates.
Drag the points below to explore how embeddings cluster by meaning:
Similar meanings cluster together. "Dog" and "puppy" are close. "Dog" and "database" are far apart.
The magic: An AI model reads each chunk and converts it into a list of numbers (usually 300-1500 dimensions) that represents its meaning.
// Sample embedding vectors (simplified to 3 dimensions)
const embeddings = {
"dog": [0.2, 0.8, 0.1],
"puppy": [0.3, 0.7, 0.2], // Close to "dog"
"cat": [0.1, 0.6, 0.3], // Pet-related, but different
"vacation": [-0.5, 0.2, 0.8], // Completely different topic
"policy": [-0.3, 0.1, 0.9] // Work-related
};
function similarity(vec1, vec2) {
// Cosine similarity: how "parallel" are two vectors?
let dotProduct = 0;
let mag1 = 0, mag2 = 0;
for (let i = 0; i < vec1.length; i++) {
dotProduct += vec1[i] * vec2[i];
mag1 += vec1[i] * vec1[i];
mag2 += vec2[i] * vec2[i];
}
return dotProduct / (Math.sqrt(mag1) * Math.sqrt(mag2));
}
// "dog" vs "puppy" = 0.97 (very similar!)
// "dog" vs "vacation" = -0.23 (very different)
In reality, embeddings have 1000+ dimensions. Each dimension captures some aspect of meaning — past tense, emotion, topic, formality, etc.
// Using OpenAI's embedding API:
async function getEmbedding(text) {
const response = await fetch('https://api.openai.com/v1/embeddings', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${API_KEY}`
},
body: JSON.stringify({
model: 'text-embedding-3-small',
input: text
})
});
const data = await response.json();
return data.data[0].embedding; // Array of 1536 numbers
}
// Convert all chunks to embeddings
const chunkEmbeddings = [];
for (const chunk of chunks) {
const embedding = await getEmbedding(chunk.text);
chunkEmbeddings.push({
text: chunk.text,
embedding: embedding
});
}
Now every chunk is a point in 1536-dimensional space. Chunks with similar meanings are close together. Time to search!
3Blue1Brown: "How LLMs store facts" — Visual explanation of how embeddings capture knowledge
Now comes the magic moment: User asks a question → embed the question → find the closest chunks.
This is **semantic search** — searching by meaning, not just keywords.
Notice how it finds relevant chunks even when your question uses different words? **"carry over" matches "unused vacation days" because they have similar embeddings.**
async function semanticSearch(query, chunkEmbeddings, topK = 3) {
// 1. Embed the user's question
const queryEmbedding = await getEmbedding(query);
// 2. Calculate similarity to every chunk
const scores = chunkEmbeddings.map(chunk => ({
...chunk,
similarity: cosineSimilarity(queryEmbedding, chunk.embedding)
}));
// 3. Sort by similarity (highest first)
scores.sort((a, b) => b.similarity - a.similarity);
// 4. Return top-K most relevant chunks
return scores.slice(0, topK);
}
function cosineSimilarity(vec1, vec2) {
const dotProduct = vec1.reduce((sum, a, i) => sum + a * vec2[i], 0);
const mag1 = Math.sqrt(vec1.reduce((sum, a) => sum + a * a, 0));
const mag2 = Math.sqrt(vec2.reduce((sum, a) => sum + a * a, 0));
return dotProduct / (mag1 * mag2);
}
The search finds chunks with similarity scores above a threshold (usually 0.7+). Lower scores mean the chunk probably isn't relevant.
For real applications, you'd store embeddings in a **vector database** like Pinecone, Chroma, or Weaviate. They can search millions of vectors in milliseconds.
// Vector database pseudocode:
const db = new VectorDB();
// Store embeddings (do this once)
await db.upsert(chunkEmbeddings);
// Search (do this for every query)
const results = await db.query({
vector: queryEmbedding,
topK: 5,
filter: { source: "company-policies" } // Optional filtering
});
// Results come back ranked by similarity
You now have a semantic search engine! But RAG adds one more step — using those search results to answer questions.
The final step: Take the retrieved chunks, inject them into the prompt, ask the LLM.
This is where **Retrieval-Augmented Generation** gets its name — we retrieve relevant information, then use it to augment the AI's generation.
With RAG: The AI answers confidently with accurate information. Without RAG: It gives a generic, unhelpful response.
The AI sees the relevant context right in the prompt. It doesn't need to guess — the answer is right there!
"How many vacation days do I get?"
Convert question to vector: [0.1, -0.3, 0.8, ...]
Find top-3 chunks with highest similarity scores
System message + retrieved chunks + user question
LLM uses the context to provide accurate response
async function ragQuery(question, documents) {
// 1. Chunk the documents
const chunks = chunkDocuments(documents);
// 2. Create embeddings for all chunks
const chunkEmbeddings = [];
for (const chunk of chunks) {
const embedding = await getEmbedding(chunk);
chunkEmbeddings.push({ text: chunk, embedding });
}
// 3. Embed the user's question
const questionEmbedding = await getEmbedding(question);
// 4. Find most relevant chunks
const relevantChunks = semanticSearch(
question,
chunkEmbeddings,
topK = 3
);
// 5. Build the prompt
const context = relevantChunks.map(chunk => chunk.text).join('\n\n');
const prompt = `
Context: ${context}
Question: ${question}
Answer based only on the context above:`;
// 6. Get the answer from LLM
const response = await llm.complete(prompt);
return response;
}
// Usage:
const answer = await ragQuery(
"How many vacation days do I get?",
[vacationPolicyDoc, hrHandbookDoc]
);
You just built what every enterprise AI product does. Slack AI, Notion AI, Microsoft Copilot — they're all variations of this exact pattern.
You now understand RAG inside and out!
You learned how to:
Next steps: Try building your own RAG system with your company's docs. The hardest part isn't the code — it's getting clean, well-chunked data.
RAG Paper (Lewis et al.) — The original Retrieval-Augmented Generation research
LangChain RAG Tutorial — Build production RAG systems
Test yourself. No peeking. These questions cover everything you just learned.
1. What is the fundamental problem that RAG solves?
2. Why do you need to chunk documents for RAG instead of using the entire document?
3. What makes embeddings useful for semantic search?
4. In semantic search, how do you find the most relevant chunks for a user's query?
5. In the RAG prompt assembly, what components are typically included?