Founder's Briefing — 10 AI Concepts

🎯Next-Token Prediction

30-second version

Large language models work by predicting the next word, over and over. That's the entire trick — there's no reasoning engine, no knowledge database. Just "given everything so far, what word comes next?" Repeat a billion times and you get ChatGPT.

⚠️ Common misconception: "The model understands what it's saying." It doesn't — it's doing statistical pattern matching at enormous scale. The appearance of understanding is emergent, not designed.

Go deeper

The training objective is called "causal language modeling" — predict token t+1 given tokens 1…t. The loss function is cross-entropy over a vocabulary of ~100K tokens. What's remarkable is that optimizing this simple objective on enough data forces the model to learn grammar, facts, reasoning patterns, and even some world models. This is why researchers call next-token prediction "the master algorithm" of modern AI.

🏗️Transformers

30-second version

The Transformer is the architecture behind every modern LLM, introduced in 2017's "Attention Is All You Need" paper. Its key innovation is self-attention: every word can look at every other word in the input simultaneously. This parallelism made it vastly more trainable than previous architectures.

⚠️ Common misconception: "Transformers were built for chatbots." They were actually designed for machine translation. Chat, code generation, and image synthesis came later as people realized the architecture generalizes to almost anything.

Go deeper

Self-attention computes a weighted sum of all positions, where the weights are learned relevance scores (Q·K^T / √d). Multi-head attention runs this in parallel across different "subspaces," letting the model attend to syntax in one head and semantics in another. The architecture also includes residual connections, layer normalization, and feed-forward networks — but attention is the breakthrough that made scaling possible. GPT-style models use decoder-only transformers with causal masking.

⚡Training vs Inference

30-second version

Training is when the model learns — it costs millions of dollars, runs on thousands of GPUs for months, and happens once per model version. Inference is when the model answers your question — it's relatively cheap and happens billions of times. Think of training as writing a textbook vs. inference as reading from it.

⚠️ Common misconception: "The model learns from my conversations." Standard inference doesn't update the model's weights. Your messages are forgotten the moment the session ends (unless the provider explicitly stores them).

Go deeper

Training involves forward passes to compute loss, then backpropagation to update billions of parameters via gradient descent. GPT-4-scale training runs cost $50–100M+ in compute alone. Inference is just the forward pass — no gradients, no weight updates. The economics of AI companies hinge on this split: amortize huge training costs across billions of cheap inference calls. "Inference-time compute" (like chain-of-thought) is a growing middle ground — spending more at inference time to get better answers.

📦Tokens & Context Windows

30-second version

Models don't see words — they see "tokens," which are word pieces. "Unbelievable" might be three tokens: "un," "believ," "able." The context window is the maximum number of tokens the model can process at once — typically 4K to 200K tokens. It's the model's working memory.

⚠️ Common misconception: "Bigger context window = the model remembers more." Context windows aren't memory — they're more like a desk. You can pile more papers on a bigger desk, but the model may still struggle to find the relevant needle in a haystack of text.

Go deeper

Tokenization uses algorithms like BPE (Byte-Pair Encoding) that merge frequent character pairs into single tokens. This is why models are worse at character-level tasks — they literally can't "see" individual letters. Context window cost scales quadratically with attention (O(n²)), though techniques like FlashAttention and sparse attention reduce this. A rough rule: 1 token ≈ ¾ of a word in English. Pricing is per-token, so understanding tokenization directly impacts your API costs.

🔧Fine-tuning vs RAG

30-second version

Fine-tuning retrains the model on your specific data — it changes the model's weights to bake in new knowledge or behavior. RAG (Retrieval-Augmented Generation) leaves the model alone and just fetches relevant documents at query time, stuffing them into the prompt. For most business use cases, RAG is cheaper, faster to set up, and easier to keep current.

⚠️ Common misconception: "We need to fine-tune a model on our data." Usually you don't. RAG handles 90% of "use our company's knowledge" use cases. Fine-tuning is for changing how the model behaves, not what it knows.

Go deeper

Fine-tuning updates model weights via continued training on a curated dataset. LoRA (Low-Rank Adaptation) makes this cheaper by only training small adapter matrices. RAG works by embedding your documents into vectors, storing them in a vector database, retrieving the top-k relevant chunks at query time, and prepending them to the prompt. The trade-off: fine-tuning is better for style, tone, and specialized reasoning; RAG is better for factual knowledge that changes. Many production systems use both.

🌀Hallucination

30-second version

Models don't retrieve facts — they generate plausible-sounding text. Sometimes plausible ≠ true. A model will confidently cite a paper that doesn't exist or invent a legal precedent. This isn't a bug to be fixed; it's a fundamental property of how generation works.

⚠️ Common misconception: "Hallucination will be solved in the next model version." It's inherent to probabilistic text generation. You can reduce it (RAG, grounding, citations) but never fully eliminate it. Any production system must design for it.

Go deeper

Hallucination stems from the training objective: the model is optimized to produce likely text, not true text. During training, the model learns statistical co-occurrences — "The capital of France is Paris" has high probability because it appeared often. But the same mechanism generates "The capital of Australia is Sydney" because that co-occurrence is also common, even though it's wrong. Mitigation strategies include retrieval grounding (RAG), chain-of-thought verification, confidence calibration, and output validation against trusted sources.

🤖Agents

30-second version

An agent is an LLM in a loop: it thinks about what to do, takes an action (search the web, run code, call an API), observes the result, and repeats. This is the ReAct pattern — reasoning + acting. It turns a chatbot into something that can actually do things in the world.

⚠️ Common misconception: "Agents are autonomous AI." Today's agents are more like interns with a checklist — they follow patterns, use tools they're given, and fail in predictable ways. The "autonomy" is a loop, not consciousness.

Go deeper

The ReAct pattern (Yao et al., 2022) interleaves reasoning traces with tool calls. The model generates a thought ("I need to look up Q3 revenue"), an action (call a search tool), and then incorporates the observation into its next reasoning step. More complex frameworks add planning, memory, and multi-agent collaboration. Key challenges: agents can get stuck in loops, compound errors across steps, and rack up costs quickly. Tool design and guardrails matter more than model choice.

📈Scaling Laws

30-second version

There's a remarkably predictable relationship: spend 10× more compute, get a measurably better model. The Chinchilla paper (2022) showed you should scale model size and training data together — not just make models bigger. This predictability is why companies pour billions into training.

⚠️ Common misconception: "We're hitting a wall — models can't get better." Scaling laws haven't broken yet, though the returns are shifting. Inference-time scaling (thinking longer, not training bigger) is opening a new axis of improvement.

Go deeper

Kaplan et al. (2020) at OpenAI discovered power-law relationships: loss decreases as a smooth function of compute, data, and parameters. Chinchilla (Hoffmann et al., 2022) refined this — many models were over-parameterized and under-trained. The optimal ratio is roughly 20 tokens per parameter. This insight shifted the industry: LLaMA 7B trained on more data outperforms models 10× its size. Current research explores "inference-time compute" scaling — spending more tokens thinking (chain-of-thought, search, verification) rather than just training bigger models.

🛡️RLHF / Alignment

30-second version

Raw LLMs are erratic — they'll happily generate toxic content or ignore your question. RLHF (Reinforcement Learning from Human Feedback) is how we tame them. Humans rank model outputs, a reward model learns their preferences, and the LLM is fine-tuned to maximize that reward. InstructGPT proved this turns a wild model into a helpful assistant.

⚠️ Common misconception: "Alignment means censorship." Alignment is about making models follow instructions reliably and refuse genuinely harmful requests. The line between "safe" and "over-cautious" is a real design challenge, not a conspiracy.

Go deeper

The RLHF pipeline has three stages: (1) supervised fine-tuning on human-written examples, (2) training a reward model on human preference rankings, (3) optimizing the LLM against the reward model using PPO (Proximal Policy Optimization). Newer approaches like DPO (Direct Preference Optimization) skip the reward model entirely. Constitutional AI (Anthropic) has the model critique its own outputs against principles. The field is evolving rapidly — RLHF is effective but imperfect, and "alignment" at a deeper level (ensuring AI goals match human values) remains an open research problem.

🔓Open vs Closed Models

30-second version

Closed models (GPT-4, Claude) are accessed via API — you can't see the weights or run them yourself. Open-weight models (LLaMA, Mistral, DeepSeek) let you download and run the model on your own hardware. The trade-off: closed models are generally more capable, but open models give you control, privacy, and no per-token costs.

⚠️ Common misconception: "Open-source models are free." The weights may be free, but running a 70B-parameter model requires serious GPU infrastructure. And "open weights" ≠ "open source" — most don't release training data or code.

Go deeper

The spectrum runs from fully closed (GPT-4: no weights, no architecture details) to fully open (OLMo: weights, data, code, training logs). Most "open" models like LLaMA are open-weight — you get the trained parameters but not the training data or full recipe. For startups, the choice depends on: volume (high volume favors self-hosting), data sensitivity (regulated industries may need on-premises), customization needs (open models can be fine-tuned freely), and capability requirements (frontier closed models still lead on complex reasoning). Many companies use a hybrid: closed models for hard tasks, open models for high-volume simple ones.

🍸 Founder's Briefing