Build Your Own Eval System

Chapter 1: Vibes Are Not Metrics

You built an AI feature. Your boss asks "how good is it?" You say "it seems pretty good."

That's not an answer.

🔗 Connection: You got a taste of evaluation in Build Your Own Fine-Tuned Model (Chapter 5) — comparing your model's outputs against a baseline. This course goes deeper: systematic metrics, automated judges, and eval pipelines that run themselves. → Build Your Own Fine-Tuned Model

The Ranking Exercise

Below are 5 AI responses to the question: "How do I make pasta?"

Click to rank them from best (1) to worst (5):

Response A: Boil water, add pasta, cook for 8-12 minutes depending on type, drain, and serve with sauce.

Response B: To make pasta: 1) Fill large pot with water, add salt, bring to rolling boil 2) Add pasta, stir occasionally 3) Cook according to package directions (usually 8-12 min) 4) Drain, reserve pasta water 5) Toss with sauce, adding pasta water if needed for consistency.

Response C: Pasta is made from wheat flour and water. The ancient Romans ate a form of pasta. You should boil it in water until it's done.

Response D: Put the pasta in cold water and wait. It will eventually get soft. Then eat it.

Response E: For perfect pasta, use 4-6 quarts of water per pound of pasta. Salt generously (1 tablespoon per quart). Bring to vigorous boil before adding pasta. Stir within first minute to prevent sticking. Test for doneness 1-2 minutes before package time suggests. Reserve 1 cup pasta water before draining. Finish cooking in sauce for 1-2 minutes for better flavor integration.

📚 Go Deeper

Why Human Evaluation Fails (YouTube) Challenges in Evaluating LLMs (Paper)

Chapter 2: What to Measure

Different tasks need different metrics. The metric you choose shapes the product you build.

Accuracy, precision, recall, F1 for classification. BLEU/ROUGE for text generation. Human preference for open-ended tasks.

Build a Confusion Matrix

You're building a spam classifier. Drag the predictions into the right boxes:

0

True Negative

0

False Positive

0

False Negative

0

True Positive

Drag these predictions:
Email about Nigerian prince → Spam Work meeting invite → Not Spam Mom's birthday reminder → Spam Crypto scam → Not Spam Fake lottery winner → Spam Friend's vacation photos → Not Spam

0%

Precision

Of emails marked spam, how many actually were?

0%

Recall

Of all spam emails, how many did we catch?

📚 Go Deeper

Precision vs Recall Explained (YouTube) BLEU vs Human Evaluation (Paper)

Chapter 3: Test Sets

Your eval is only as good as your test data. A good test set is the most valuable thing in your AI project.

Build a test set: input → expected output pairs. Include edge cases, adversarial examples, and typical queries.

Build a Customer Support Test Set

Create 10 test cases for a customer support bot:

📚 Go Deeper

Building Better Test Sets (YouTube) Adversarial Examples in NLP (Paper)

Chapter 4: Automated Eval

Running evals by hand doesn't scale. Use an LLM to judge another LLM's outputs (LLM-as-judge).

LLM judges agree with humans 80% of the time — enough to catch regressions.

Build an LLM Judge

Write an evaluation prompt:

📚 Go Deeper

LLM-as-Judge Tutorial (YouTube) Judging LLM-as-a-Judge (Paper)

Chapter 5: The Eval Loop

Change something (prompt, model, temperature) → run eval → compare.

Every improvement to your AI should start and end with an eval.

A/B Test: Prompt Variations

Test two versions of your system prompt:

Version A: Basic

You are a helpful customer support assistant. Answer questions clearly and professionally.

0.0

Average Score

Version B: Detailed

You are a customer support specialist. Always acknowledge the customer's concern, provide accurate information, suggest next steps, and end with asking if they need further help.

0.0

Average Score

Eval Dashboard

Track your metrics over time:

Performance Over Time

Week 1: 3.2/5

Week 2: 3.5/5

Week 3: 4.2/5

Week 4: 4.5/5

📚 Go Deeper

LMSYS Chatbot Arena OpenAI Evals Framework Building Eval Pipelines (YouTube)

🎉 You Built an Eval System!

You can now measure, compare, and improve any AI system.

This is the skill that separates professionals from hobbyists.

Build Your Own Eval System

Chapter 1: Vibes Are Not Metrics

The Ranking Exercise

The Truth: Everyone Ranks Differently

📚 Go Deeper

Chapter 2: What to Measure

Build a Confusion Matrix

📚 Go Deeper

Chapter 3: Test Sets

Build a Customer Support Test Set

Your Test Set Preview

📚 Go Deeper

Chapter 4: Automated Eval

Build an LLM Judge

Evaluation Results

📚 Go Deeper

Chapter 5: The Eval Loop

A/B Test: Prompt Variations

Version A: Basic

Version B: Detailed

Eval Dashboard

Performance Over Time

📚 Go Deeper

🎉 You Built an Eval System!

🧠 Final Recall