← Back to Series Course 1 of 14

Build Your Own LLM — From Scratch

You use AI every day. But do you know how it actually works?

In 30 minutes, you'll build a language model from scratch — and finally understand the magic.

For absolute beginners. No math degree required.

Chapter 1: Letters to Numbers

Computers don't understand words. They only understand numbers.

Before any AI can think about language, it has to convert text into numbers. This process is called tokenization — breaking text into small pieces (tokens) and assigning each piece a unique number.

Try It: Live Tokenizer

Tokens (words/pieces):
Token IDs (numbers):

Notice how "understanding" becomes ["under", "stand", "ing"]? The AI learns that words with similar parts have similar meanings.

Try typing "misunderstanding" — see how it reuses the same pieces?

BPE: Building the Vocabulary

How does the AI know to split "understanding" that way? It uses **Byte Pair Encoding (BPE)** — it looks at lots of text and finds the most common pairs of characters.

// Simple BPE example:
// 1. Start with characters: u-n-d-e-r-s-t-a-n-d-i-n-g
// 2. Find most common pairs: "nd" appears twice
// 3. Merge: u-(nd)-e-r-s-t-a-(nd)-i-n-g  
// 4. Repeat until you have ~50,000 tokens

function buildVocab(texts) {
  let vocab = [...'abcdefghijklmnopqrstuvwxyz '];
  
  for (let merges = 0; merges < 1000; merges++) {
    let pairCounts = {};
    
    // Count all adjacent pairs
    texts.forEach(text => {
      for (let i = 0; i < text.length - 1; i++) {
        let pair = text.slice(i, i + 2);
        pairCounts[pair] = (pairCounts[pair] || 0) + 1;
      }
    });
    
    // Find most common pair and merge it
    let bestPair = Object.keys(pairCounts).reduce(
      (a, b) => pairCounts[a] > pairCounts[b] ? a : b
    );
    
    vocab.push(bestPair);
    texts = texts.map(text => text.replaceAll(bestPair, bestPair));
  }
  
  return vocab;
}

Every word becomes a list of numbers. "Hello world" → [15496, 1917]. Now the AI can work with it.

📚 Go Deeper

Andrej Karpathy: "Tokenization Explained" — The clearest explanation of how tokenizers work

Chapter 2: The Prediction Game

Language modeling is just predicting the next word. Given some text, what word comes next?

Let's start simple: with a **frequency table**. If we see "I like" in our training data, what usually comes after?

Try It: Bigram Prediction

Text so far: "The cat sat on the"

What comes next?

You just did language modeling! You used your knowledge to predict the next word. A **bigram model** does the same thing with statistics.

Building a Bigram Table

Previous Word Next Word Count Probability
thecat50.25
thedog30.15
themat80.40
thefloor40.20

**The AI picks the most likely next word based on what it saw in training.** But there's a problem...

// Bigram problem: only looks at ONE previous word
"The cat sat on the ___"
// Only sees "the", ignores "cat sat on"
// Might predict "cat" again!

Bigrams are dumb. They only look at one word back. Neural networks can look at ALL previous words at once.

Neural Network Magic

Instead of counting words, a **neural network learns weights** — numbers that capture patterns in language.

// Simplified neural network prediction
function predict(tokens, weights) {
  let score = 0;
  
  // Each token contributes to the prediction
  for (let i = 0; i < tokens.length; i++) {
    let tokenId = tokens[i];
    let position = i;
    
    // Learned weights for this token at this position
    score += weights[tokenId][position];
  }
  
  // Convert score to probability
  return softmax(score);
}

// The network learns: what patterns predict what words?
// "The cat sat" → high score for "on"
// "I love" → high score for "you"

Training adjusts the weights until predictions match real text. See 1 billion examples, learn 1 billion patterns.

📚 Go Deeper

Karpathy: "Deep Dive into LLMs" (start at 23:04) — How neural networks learn to predict

Chapter 3: The Attention Trick

Even smart neural networks have a problem: they treat all words equally. But some words are more important than others.

"The cat that my neighbor owns sat on the mat" — to predict what comes after "sat", which words matter most? "cat" and "sat", not "neighbor" or "owns".

Attention lets every word look at every other word and decide what's important.

Try It: Attention Heatmap

Each cell shows how much one word "attends to" another. Darker = more attention.

Notice how "sat" pays attention to "cat" (the thing doing the sitting) and "on" pays attention to "mat" (the location)? The network learns these relationships automatically.

The Math (With Real Numbers)

**Attention uses three matrices: Query (Q), Key (K), and Value (V).** Think of it like a database lookup:

// Example with tiny numbers:
// Word: "cat"
let query = [0.1, 0.8];    // "What am I looking for?"
let key   = [0.2, 0.9];    // "What do I contain?"  
let value = [0.5, 0.3];    // "What information do I have?"

// Attention score = query · key (dot product)
let score = query[0]*key[0] + query[1]*key[1]
         = 0.1*0.2 + 0.8*0.9 
         = 0.02 + 0.72 
         = 0.74

// High score = pay attention to this word

**The scaled dot-product attention equation:**

function attention(Q, K, V) {
  // 1. Calculate all attention scores
  let scores = matmul(Q, transpose(K));
  
  // 2. Scale by sqrt(dimension) to prevent explosion
  scores = scale(scores, 1 / Math.sqrt(K[0].length));
  
  // 3. Softmax: turn scores into probabilities
  let weights = softmax(scores);
  
  // 4. Mix the values according to attention weights
  let output = matmul(weights, V);
  
  return output;
}

// This happens in parallel for every word in the sentence

Every word gets to vote on what every other word should represent. This is the breakthrough that made modern AI possible.

📚 Go Deeper

3Blue1Brown: "Attention in Transformers" — Beautiful visual explanation

"Attention Is All You Need" paper (§3.2) — The original attention paper

Chapter 4: Stack and Scale

One attention layer is weak. It can only learn simple patterns. But stack many layers together, and magic happens.

Each layer learns different things:

Try It: Layer by Layer

Input: "The capital of France is"
Output: "The capital of France is the"

Watch how the quality improves as you add layers. Real models like GPT-4 have 100+ layers!

More layers = more complex thinking. But there's a pattern to how much better they get...

The Scaling Laws

Here's the most important discovery in AI: quality improves predictably with scale.

// The power law of language models:
Loss = A × (Compute)^(-α)

Where:
- Loss = how wrong the model is (lower = better)
- Compute = training budget (GPUs × time)  
- α ≈ 0.05 (the scaling exponent)

Translation: 10x more compute → predictably better AI

This is why everyone is building bigger models:

  • GPT-1 (2018): 117M parameters, $600 training cost
  • GPT-2 (2019): 1.5B parameters, $40K training cost
  • GPT-3 (2020): 175B parameters, $4.6M training cost
  • GPT-4 (2023): ~1.8T parameters, ~$100M training cost

The Chinchilla insight: Don't just make models bigger — train them on more data too. The optimal ratio is ~20 tokens per parameter.

GPT-4 is this exact architecture, just bigger. You now understand how it thinks.

📚 Go Deeper

Chinchilla Paper — How to scale compute and data optimally

Karpathy: "Let's reproduce GPT-2" — Building a transformer from scratch

Chapter 5: Make It Chat

A raw language model just predicts text. Feed it "The capital of France is" and it might continue with "located in the heart of Europe" — technically correct but not conversational.

How do you turn a text predictor into a helpful assistant? You teach it the format of conversation.

Try It: Base vs Chat Model

Your input: "What is Python?"
Base model output:
What is Python? Python is a high-level programming language that was first released in 1991. It was created by Guido van Rossum. Python is often used for web development, data analysis, artificial intelligence, and scientific computing. The language emphasizes code readability...

Same model, different behavior! The chat version learned conversational format during training.

The Training Process

Making a model conversational requires three steps:

// Step 1: Instruction Tuning
// Show the model conversation examples:
[
  {
    "messages": [
      {"role": "user", "content": "What is Python?"},
      {"role": "assistant", "content": "Python is a programming language..."}
    ]
  },
  // ... thousands more examples
]

// Step 2: Human Feedback (RLHF)  
// Humans rank model responses:
// Response A: "Python is a programming language..." ⭐⭐⭐⭐⭐
// Response B: "Python is a snake that..." ⭐⭐

// Step 3: System Prompts
// Hidden instructions that shape behavior:
"You are a helpful, harmless, and honest assistant. 
Answer questions clearly and concisely."

The model learns: "When I see a conversation format, I should be helpful, not just continue the text."

You Built an LLM!

🎉 Congratulations! 🎉

You now understand every major component of modern language models:

  • Tokenization: Converting text to numbers
  • Prediction: Using patterns to guess next words
  • Attention: Letting words talk to each other
  • Scaling: More layers = more intelligence
  • Chat Training: Teaching conversation format

GPT, Claude, Gemini — they're all variations of what you just learned. Different training data, different fine-tuning, but the same core architecture.

The next time someone asks "How does ChatGPT work?" — you can actually explain it.

📚 Go Deeper

InstructGPT Paper — How to train models to follow instructions

"Training language models to follow instructions" — The RLHF process

🧠 Final Recall

Test yourself. No peeking. These questions cover everything you just learned.

1. A BPE tokenizer splits "understanding" into ["under", "stand", "ing"]. Why does it do this instead of character-by-character?





2. Why are bigram models insufficient for modern language modeling?





3. In the attention equation Attention(Q,K,V), what does the dot product Q·K represent?





4. According to scaling laws, if training compute increases 10x, how much does model loss typically improve?





5. What's the key difference between a base language model and a chat model?





← Previous: Series Index Next: RAG System →