You use AI every day. But do you know how it actually works?
In 30 minutes, you'll build a language model from scratch — and finally understand the magic.
For absolute beginners. No math degree required.
Computers don't understand words. They only understand numbers.
Before any AI can think about language, it has to convert text into numbers. This process is called tokenization — breaking text into small pieces (tokens) and assigning each piece a unique number.
Notice how "understanding" becomes ["under", "stand", "ing"]? The AI learns that words with similar parts have similar meanings.
Try typing "misunderstanding" — see how it reuses the same pieces?
How does the AI know to split "understanding" that way? It uses **Byte Pair Encoding (BPE)** — it looks at lots of text and finds the most common pairs of characters.
// Simple BPE example:
// 1. Start with characters: u-n-d-e-r-s-t-a-n-d-i-n-g
// 2. Find most common pairs: "nd" appears twice
// 3. Merge: u-(nd)-e-r-s-t-a-(nd)-i-n-g
// 4. Repeat until you have ~50,000 tokens
function buildVocab(texts) {
let vocab = [...'abcdefghijklmnopqrstuvwxyz '];
for (let merges = 0; merges < 1000; merges++) {
let pairCounts = {};
// Count all adjacent pairs
texts.forEach(text => {
for (let i = 0; i < text.length - 1; i++) {
let pair = text.slice(i, i + 2);
pairCounts[pair] = (pairCounts[pair] || 0) + 1;
}
});
// Find most common pair and merge it
let bestPair = Object.keys(pairCounts).reduce(
(a, b) => pairCounts[a] > pairCounts[b] ? a : b
);
vocab.push(bestPair);
texts = texts.map(text => text.replaceAll(bestPair, bestPair));
}
return vocab;
}
Every word becomes a list of numbers. "Hello world" → [15496, 1917]. Now the AI can work with it.
Andrej Karpathy: "Tokenization Explained" — The clearest explanation of how tokenizers work
Language modeling is just predicting the next word. Given some text, what word comes next?
Let's start simple: with a **frequency table**. If we see "I like" in our training data, what usually comes after?
Text so far: "The cat sat on the"
What comes next?
You just did language modeling! You used your knowledge to predict the next word. A **bigram model** does the same thing with statistics.
| Previous Word | Next Word | Count | Probability |
|---|---|---|---|
| the | cat | 5 | 0.25 |
| the | dog | 3 | 0.15 |
| the | mat | 8 | 0.40 |
| the | floor | 4 | 0.20 |
**The AI picks the most likely next word based on what it saw in training.** But there's a problem...
// Bigram problem: only looks at ONE previous word
"The cat sat on the ___"
// Only sees "the", ignores "cat sat on"
// Might predict "cat" again!
Bigrams are dumb. They only look at one word back. Neural networks can look at ALL previous words at once.
Instead of counting words, a **neural network learns weights** — numbers that capture patterns in language.
// Simplified neural network prediction
function predict(tokens, weights) {
let score = 0;
// Each token contributes to the prediction
for (let i = 0; i < tokens.length; i++) {
let tokenId = tokens[i];
let position = i;
// Learned weights for this token at this position
score += weights[tokenId][position];
}
// Convert score to probability
return softmax(score);
}
// The network learns: what patterns predict what words?
// "The cat sat" → high score for "on"
// "I love" → high score for "you"
Training adjusts the weights until predictions match real text. See 1 billion examples, learn 1 billion patterns.
Karpathy: "Deep Dive into LLMs" (start at 23:04) — How neural networks learn to predict
Even smart neural networks have a problem: they treat all words equally. But some words are more important than others.
"The cat that my neighbor owns sat on the mat" — to predict what comes after "sat", which words matter most? "cat" and "sat", not "neighbor" or "owns".
Attention lets every word look at every other word and decide what's important.
Each cell shows how much one word "attends to" another. Darker = more attention.
Notice how "sat" pays attention to "cat" (the thing doing the sitting) and "on" pays attention to "mat" (the location)? The network learns these relationships automatically.
**Attention uses three matrices: Query (Q), Key (K), and Value (V).** Think of it like a database lookup:
// Example with tiny numbers:
// Word: "cat"
let query = [0.1, 0.8]; // "What am I looking for?"
let key = [0.2, 0.9]; // "What do I contain?"
let value = [0.5, 0.3]; // "What information do I have?"
// Attention score = query · key (dot product)
let score = query[0]*key[0] + query[1]*key[1]
= 0.1*0.2 + 0.8*0.9
= 0.02 + 0.72
= 0.74
// High score = pay attention to this word
**The scaled dot-product attention equation:**
function attention(Q, K, V) {
// 1. Calculate all attention scores
let scores = matmul(Q, transpose(K));
// 2. Scale by sqrt(dimension) to prevent explosion
scores = scale(scores, 1 / Math.sqrt(K[0].length));
// 3. Softmax: turn scores into probabilities
let weights = softmax(scores);
// 4. Mix the values according to attention weights
let output = matmul(weights, V);
return output;
}
// This happens in parallel for every word in the sentence
Every word gets to vote on what every other word should represent. This is the breakthrough that made modern AI possible.
3Blue1Brown: "Attention in Transformers" — Beautiful visual explanation
"Attention Is All You Need" paper (§3.2) — The original attention paper
One attention layer is weak. It can only learn simple patterns. But stack many layers together, and magic happens.
Each layer learns different things:
Watch how the quality improves as you add layers. Real models like GPT-4 have 100+ layers!
More layers = more complex thinking. But there's a pattern to how much better they get...
Here's the most important discovery in AI: quality improves predictably with scale.
// The power law of language models:
Loss = A × (Compute)^(-α)
Where:
- Loss = how wrong the model is (lower = better)
- Compute = training budget (GPUs × time)
- α ≈ 0.05 (the scaling exponent)
Translation: 10x more compute → predictably better AI
This is why everyone is building bigger models:
The Chinchilla insight: Don't just make models bigger — train them on more data too. The optimal ratio is ~20 tokens per parameter.
GPT-4 is this exact architecture, just bigger. You now understand how it thinks.
Chinchilla Paper — How to scale compute and data optimally
Karpathy: "Let's reproduce GPT-2" — Building a transformer from scratch
A raw language model just predicts text. Feed it "The capital of France is" and it might continue with "located in the heart of Europe" — technically correct but not conversational.
How do you turn a text predictor into a helpful assistant? You teach it the format of conversation.
Same model, different behavior! The chat version learned conversational format during training.
Making a model conversational requires three steps:
// Step 1: Instruction Tuning
// Show the model conversation examples:
[
{
"messages": [
{"role": "user", "content": "What is Python?"},
{"role": "assistant", "content": "Python is a programming language..."}
]
},
// ... thousands more examples
]
// Step 2: Human Feedback (RLHF)
// Humans rank model responses:
// Response A: "Python is a programming language..." ⭐⭐⭐⭐⭐
// Response B: "Python is a snake that..." ⭐⭐
// Step 3: System Prompts
// Hidden instructions that shape behavior:
"You are a helpful, harmless, and honest assistant.
Answer questions clearly and concisely."
The model learns: "When I see a conversation format, I should be helpful, not just continue the text."
🎉 Congratulations! 🎉
You now understand every major component of modern language models:
GPT, Claude, Gemini — they're all variations of what you just learned. Different training data, different fine-tuning, but the same core architecture.
The next time someone asks "How does ChatGPT work?" — you can actually explain it.
InstructGPT Paper — How to train models to follow instructions
"Training language models to follow instructions" — The RLHF process
Test yourself. No peeking. These questions cover everything you just learned.
1. A BPE tokenizer splits "understanding" into ["under", "stand", "ing"]. Why does it do this instead of character-by-character?
2. Why are bigram models insufficient for modern language modeling?
3. In the attention equation Attention(Q,K,V), what does the dot product Q·K represent?
4. According to scaling laws, if training compute increases 10x, how much does model loss typically improve?
5. What's the key difference between a base language model and a chat model?