Build Your Own RAG System

Chapter 1: The Problem

AI models are trained on data from the past. They know about Shakespeare, but not your company's policies. They know about Python, but not your codebase. They know about cats, but not what happened in your meeting last Tuesday.

Ask an AI about something it hasn't seen before, and it will **hallucinate** — make stuff up confidently.

Try It: Watch It Hallucinate

What is the vacation policy at Acme Corp?

Acme Corp offers 15 days of paid vacation per year, plus 10 sick days. Employees can carry over up to 5 unused vacation days to the next year. The company also provides 2 weeks of paternity/maternity leave.

This is completely made up! The AI has never seen Acme Corp's actual policy.

The AI sounds confident, but it's lying. It doesn't know it's lying — it's just predicting what a reasonable policy might look like.

The Core Problem

// What the AI knows:
✅ General knowledge (Wikipedia, books, web)
✅ Programming concepts (GitHub, Stack Overflow) 
✅ Common facts and patterns

// What the AI doesn't know:
❌ Your company's internal documents
❌ Recent events (after training cutoff)
❌ Personal information
❌ Proprietary data
❌ Real-time information

LLMs don't know your data. You have to give it to them. That's where RAG comes in.

**RAG = Retrieval-Augmented Generation.** Instead of just asking the AI, you:

Find relevant documents
Give them to the AI as context
Ask your question

Chapter 2: Chunks

You can't paste a whole book into the AI prompt. Context windows are limited (usually 4K-128K tokens), and longer prompts cost more and run slower.

The solution: **break documents into small pieces called "chunks."** Each chunk contains one coherent idea or topic.

Try It: Live Document Chunker

Chunk by:

Overlap: 10%

Notice how overlap helps? If a sentence is split across chunks, the overlap ensures both chunks have enough context.

Chunk Size Matters

// Too small (< 100 tokens):
"Our company's vacation policy"
// Problem: No context, meaningless fragment

// Too large (> 1000 tokens):  
[Entire 5-page policy document]
// Problem: Too much noise, key info gets lost

// Just right (200-500 tokens):
"Our company's vacation policy is designed to promote 
work-life balance. All full-time employees earn 15 days 
of paid vacation annually, starting from their hire date.
Vacation days must be requested at least two weeks in 
advance through our HR system..."
// Perfect: Complete thought with context

The right chunk size is the difference between finding the answer and missing it.

📚 Go Deeper

Evaluating Ideal Chunk Size for RAG — Research on optimal chunking strategies

Chapter 3: Embeddings

Each chunk needs to become a number so you can search it. But not just any number — a special kind of number that captures **meaning**.

**Embeddings** are like GPS coordinates, but for meaning instead of location. Similar ideas get similar coordinates.

🔗 Connection: You'll see embeddings again in Build Your Own Vision (where CLIP puts images and text in the same vector space) and Build Your Own Fine-Tune (where the model adjusts its internal embeddings during training). This is one of the most important concepts in all of AI.

Try It: Vector Space Explorer

Drag the points below to explore how embeddings cluster by meaning:

Similar meanings cluster together. "Dog" and "puppy" are close. "Dog" and "database" are far apart.

The magic: An AI model reads each chunk and converts it into a list of numbers (usually 300-1500 dimensions) that represents its meaning.

How Embeddings Work

// Sample embedding vectors (simplified to 3 dimensions)
const embeddings = {
  "dog": [0.2, 0.8, 0.1],
  "puppy": [0.3, 0.7, 0.2],      // Close to "dog"
  "cat": [0.1, 0.6, 0.3],        // Pet-related, but different
  "vacation": [-0.5, 0.2, 0.8],  // Completely different topic
  "policy": [-0.3, 0.1, 0.9]     // Work-related
};

function similarity(vec1, vec2) {
  // Cosine similarity: how "parallel" are two vectors?
  let dotProduct = 0;
  let mag1 = 0, mag2 = 0;
  
  for (let i = 0; i < vec1.length; i++) {
    dotProduct += vec1[i] * vec2[i];
    mag1 += vec1[i] * vec1[i];
    mag2 += vec2[i] * vec2[i];
  }
  
  return dotProduct / (Math.sqrt(mag1) * Math.sqrt(mag2));
}

// "dog" vs "puppy" = 0.97 (very similar!)
// "dog" vs "vacation" = -0.23 (very different)

In reality, embeddings have 1000+ dimensions. Each dimension captures some aspect of meaning — past tense, emotion, topic, formality, etc.

Creating Embeddings

// Using OpenAI's embedding API:
async function getEmbedding(text) {
  const response = await fetch('https://api.openai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${API_KEY}`
    },
    body: JSON.stringify({
      model: 'text-embedding-3-small',
      input: text
    })
  });
  
  const data = await response.json();
  return data.data[0].embedding; // Array of 1536 numbers
}

// Convert all chunks to embeddings
const chunkEmbeddings = [];
for (const chunk of chunks) {
  const embedding = await getEmbedding(chunk.text);
  chunkEmbeddings.push({
    text: chunk.text,
    embedding: embedding
  });
}

Now every chunk is a point in 1536-dimensional space. Chunks with similar meanings are close together. Time to search!

📚 Go Deeper

3Blue1Brown: "How LLMs store facts" — Visual explanation of how embeddings capture knowledge

Chapter 4: Search

Now comes the magic moment: User asks a question → embed the question → find the closest chunks.

This is **semantic search** — searching by meaning, not just keywords.

Try It: Semantic Search

Notice how it finds relevant chunks even when your question uses different words? **"carry over" matches "unused vacation days" because they have similar embeddings.**

The Search Algorithm

async function semanticSearch(query, chunkEmbeddings, topK = 3) {
  // 1. Embed the user's question
  const queryEmbedding = await getEmbedding(query);
  
  // 2. Calculate similarity to every chunk
  const scores = chunkEmbeddings.map(chunk => ({
    ...chunk,
    similarity: cosineSimilarity(queryEmbedding, chunk.embedding)
  }));
  
  // 3. Sort by similarity (highest first)
  scores.sort((a, b) => b.similarity - a.similarity);
  
  // 4. Return top-K most relevant chunks
  return scores.slice(0, topK);
}

function cosineSimilarity(vec1, vec2) {
  const dotProduct = vec1.reduce((sum, a, i) => sum + a * vec2[i], 0);
  const mag1 = Math.sqrt(vec1.reduce((sum, a) => sum + a * a, 0));
  const mag2 = Math.sqrt(vec2.reduce((sum, a) => sum + a * a, 0));
  return dotProduct / (mag1 * mag2);
}

The search finds chunks with similarity scores above a threshold (usually 0.7+). Lower scores mean the chunk probably isn't relevant.

Vector Databases

For real applications, you'd store embeddings in a **vector database** like Pinecone, Chroma, or Weaviate. They can search millions of vectors in milliseconds.

// Vector database pseudocode:
const db = new VectorDB();

// Store embeddings (do this once)
await db.upsert(chunkEmbeddings);

// Search (do this for every query)
const results = await db.query({
  vector: queryEmbedding,
  topK: 5,
  filter: { source: "company-policies" } // Optional filtering
});

// Results come back ranked by similarity

You now have a semantic search engine! But RAG adds one more step — using those search results to answer questions.

Chapter 5: Generate

The final step: Take the retrieved chunks, inject them into the prompt, ask the LLM.

This is where **Retrieval-Augmented Generation** gets its name — we retrieve relevant information, then use it to augment the AI's generation.

Try It: RAG vs No-RAG

Question: "How many vacation days do I get per year?"

I don't have specific information about your company's vacation policy. Typically, companies offer between 10-25 days per year depending on seniority and location. You should check with your HR department for the exact policy.

With RAG: The AI answers confidently with accurate information. Without RAG: It gives a generic, unhelpful response.

Behind the Scenes: Prompt Assembly

The RAG Prompt (with retrieved chunks):

System: You are a helpful assistant. Answer the user's question using only the provided context. If the context doesn't contain the answer, say so.

Context: Our company's vacation policy is designed to promote work-life balance. All full-time employees earn 15 days of paid vacation annually, starting from their hire date. Vacation days must be requested at least two weeks in advance through our HR system. Managers will approve requests based on operational needs and team coverage.

User: How many vacation days do I get per year?

The AI sees the relevant context right in the prompt. It doesn't need to guess — the answer is right there!

The Complete RAG Pipeline

1

User asks a question

"How many vacation days do I get?"

↓

2

Embed the question

Convert question to vector: [0.1, -0.3, 0.8, ...]

↓

3

Search for relevant chunks

Find top-3 chunks with highest similarity scores

↓

4

Build the prompt

System message + retrieved chunks + user question

↓

5

Generate answer

LLM uses the context to provide accurate response

The Complete Code

async function ragQuery(question, documents) {
  // 1. Chunk the documents
  const chunks = chunkDocuments(documents);
  
  // 2. Create embeddings for all chunks  
  const chunkEmbeddings = [];
  for (const chunk of chunks) {
    const embedding = await getEmbedding(chunk);
    chunkEmbeddings.push({ text: chunk, embedding });
  }
  
  // 3. Embed the user's question
  const questionEmbedding = await getEmbedding(question);
  
  // 4. Find most relevant chunks
  const relevantChunks = semanticSearch(
    question, 
    chunkEmbeddings, 
    topK = 3
  );
  
  // 5. Build the prompt
  const context = relevantChunks.map(chunk => chunk.text).join('\n\n');
  const prompt = `
    Context: ${context}
    
    Question: ${question}
    
    Answer based only on the context above:`;
  
  // 6. Get the answer from LLM
  const response = await llm.complete(prompt);
  return response;
}

// Usage:
const answer = await ragQuery(
  "How many vacation days do I get?", 
  [vacationPolicyDoc, hrHandbookDoc]
);

You just built what every enterprise AI product does. Slack AI, Notion AI, Microsoft Copilot — they're all variations of this exact pattern.

🎉 Congratulations! 🎉

You now understand RAG inside and out!

You learned how to:

✅ Identify the problem: AI models don't know your data
✅ Chunk documents: Break text into searchable pieces
✅ Create embeddings: Convert text to meaning vectors
✅ Semantic search: Find relevant chunks by similarity
✅ Generate answers: Use retrieved context to ground AI responses

Next steps: Try building your own RAG system with your company's docs. The hardest part isn't the code — it's getting clean, well-chunked data.

📚 Go Deeper

RAG Paper (Lewis et al.) — The original Retrieval-Augmented Generation research

LangChain RAG Tutorial — Build production RAG systems

Build Your Own RAG System — From Scratch

Chapter 1: The Problem

Try It: Watch It Hallucinate

The Core Problem

Chapter 2: Chunks

Try It: Live Document Chunker

Chunk Size Matters

📚 Go Deeper

Chapter 3: Embeddings

Try It: Vector Space Explorer

How Embeddings Work

Creating Embeddings

📚 Go Deeper

Chapter 4: Search

Try It: Semantic Search

The Search Algorithm

Vector Databases

Chapter 5: Generate

Try It: RAG vs No-RAG

Behind the Scenes: Prompt Assembly

The RAG Prompt (with retrieved chunks):

The Complete RAG Pipeline

User asks a question

Embed the question

Search for relevant chunks

Build the prompt

Generate answer

The Complete Code

🎉 Congratulations! 🎉

📚 Go Deeper

🧠 Final Recall