📖Lessons
Why Testing AI Systems is Different
Understand the unique challenges of testing AI and LLM applications
Unit Testing LLM Applications
Learn practical techniques for unit testing individual components of LLM applications
Evaluation Metrics for LLMs
Learn how to measure LLM output quality with appropriate metrics
Creating Test Datasets
Learn how to design comprehensive test datasets for LLM evaluation
LLM-as-Judge Evaluation
Use LLMs to evaluate other LLM outputs at scale
Human Evaluation
Design and implement effective human evaluation processes
Integration Testing
Test complete LLM pipelines, RAG systems, and agents end-to-end
Workshop: Building Test Suites
Hands-on workshop building a complete testing framework for LLM applications
Regression Testing
Prevent quality degradation and maintain consistent performance
Production Monitoring
Monitor LLM quality and performance in real-time production environments
AI Safety & Guardrails
Build input validation, output filtering, PII detection, and content safety systems for production AI
🎯Missions
M-061Break the Division Function
This division function works for most inputs. Find the inputs that break it.
M-069Build a Quality Regression Detector
Nebula Corp just updated their chatbot's prompt and needs to verify the change didn't break anything. Build a regression detection system that compares evaluation results from two versions (baseline vs candidate), identifies statistically significant regressions per category, and produces a go/no-go deployment recommendation.
M-062Build a Test Dataset Generator
Nebula Corp needs to systematically test their AI chatbot but creating test cases manually is slow. Build a test dataset generator that takes a set of documents and automatically generates question-answer pairs for evaluation. The generator should create questions at different difficulty levels, pair them with expected answers from the source documents, and output a structured test dataset.
M-067Build an AI Evaluation Metrics Calculator
Nebula Corp needs to measure their chatbot's quality with standard metrics. Build a metrics calculator that computes precision, recall, F1 score, and semantic similarity for AI-generated responses compared to ground truth answers. The calculator should handle edge cases and produce a comprehensive evaluation report.
M-068Build an AI Safety Guardrails Pipeline
Nebula Corp's chatbot has no safety filters — it happily leaks PII, follows injected instructions, and generates harmful content. Build a complete safety pipeline with input validation (prompt injection detection, topic boundaries), output filtering (PII redaction, harmful content blocking), and safety event logging.
M-066Build an LLM-as-Judge Evaluator
Nebula Corp's AI team needs to evaluate chatbot responses at scale. Manual review doesn't scale, so they want an LLM-as-Judge system. Build an evaluator that scores responses on relevance, helpfulness, and safety using a simulated judge LLM. The evaluator should produce structured scores, detect low-quality responses, and generate an aggregate quality report.
M-063Build Your First LLM Eval Scorer
Nebula Corp's AI team has no way to measure whether their LLM outputs are any good. Build a simple evaluation scorer that checks LLM responses against expected answers using multiple metrics: exact match, keyword containment, and a basic similarity score based on word overlap. This is the foundation of every eval pipeline.
M-065Fix the Broken Sort
The array sorting function at Nebula Corp is producing wrong results for certain inputs. Engineers have reported that numeric arrays come back in the wrong order. Dig into the code, find the bug, and fix it.
M-070Red Team the Support Agent
Nebula Corp's customer support agent has a system prompt that restricts it to only answering product questions. Your mission: first, find a prompt injection that bypasses the guardrails. Then, patch the system prompt to defend against the attack.
M-064Test the Email Validator
Write test cases for Nebula Corp's email validation function.
🔧Workshops
W-019AI Regression Test Suite Builder
Build a complete regression testing pipeline for AI systems. Create test datasets, run evaluations, compare versions, and generate go/no-go deployment reports.
W-018Mini Evaluation Harness
Create an eval harness with 5 metrics: semantic similarity, factuality, relevance, coherence, and groundedness.