🚀 We're in early access! Submit feedback — your input shapes the platform.
← All Topics

Testing And Evaluation

📖 11 lessons🎯 10 missions🔧 2 workshops🚀 1 project⏱️ ~16 hours
Filter by rank:

📖Lessons

1
beginner📖 14 minlesson

Why Testing AI Systems is Different

Understand the unique challenges of testing AI and LLM applications

testingfundamentalschallengesphilosophy
2
beginner📖 15 minlesson

Unit Testing LLM Applications

Learn practical techniques for unit testing individual components of LLM applications

unit-testingmockingcomponentstesting
🔒
intermediate📖 16 minlessonPRO

Evaluation Metrics for LLMs

Learn how to measure LLM output quality with appropriate metrics

metricsevaluationqualitymeasurement
🔒
intermediate📖 15 minlessonPRO

Creating Test Datasets

Learn how to design comprehensive test datasets for LLM evaluation

datasetstest-casesedge-casesevaluation
🔒
intermediate📖 14 minlessonPRO

LLM-as-Judge Evaluation

Use LLMs to evaluate other LLM outputs at scale

llm-judgeautomated-evalscalingmetrics
🔒
intermediate📖 16 minlessonPRO

Human Evaluation

Design and implement effective human evaluation processes

human-evalrating-scalesrubricsquality
🔒
advanced📖 15 minlessonPRO

Integration Testing

Test complete LLM pipelines, RAG systems, and agents end-to-end

integrationpipelinesragagentse2e
🔒
advanced📖 25 minlessonPRO

Workshop: Building Test Suites

Hands-on workshop building a complete testing framework for LLM applications

workshophands-ontesting-frameworkautomationci-cd
🔒
advanced📖 16 minlessonPRO

Regression Testing

Prevent quality degradation and maintain consistent performance

regressionversioningbaselinescontinuous-eval
🔒
advanced📖 16 minlessonPRO

Production Monitoring

Monitor LLM quality and performance in real-time production environments

monitoringproductionobservabilityalertingmetrics
🔒
intermediate📖 20 minlessonPRO

AI Safety & Guardrails

Build input validation, output filtering, PII detection, and content safety systems for production AI

safetyguardrailspiicontent-filteringvalidationproduction

🎯Missions

🔒
beginner🎯 15–30 minmissionRank 01PRO

M-061Break the Division Function

This division function works for most inputs. Find the inputs that break it.

🔒
advanced🎯 35–50 minmissionRank 09PRO

M-069Build a Quality Regression Detector

Nebula Corp just updated their chatbot's prompt and needs to verify the change didn't break anything. Build a regression detection system that compares evaluation results from two versions (baseline vs candidate), identifies statistically significant regressions per category, and produces a go/no-go deployment recommendation.

🔒
beginner🎯 25–40 minmissionRank 08PRO

M-062Build a Test Dataset Generator

Nebula Corp needs to systematically test their AI chatbot but creating test cases manually is slow. Build a test dataset generator that takes a set of documents and automatically generates question-answer pairs for evaluation. The generator should create questions at different difficulty levels, pair them with expected answers from the source documents, and output a structured test dataset.

🔒
intermediate🎯 30–45 minmissionRank 08PRO

M-067Build an AI Evaluation Metrics Calculator

Nebula Corp needs to measure their chatbot's quality with standard metrics. Build a metrics calculator that computes precision, recall, F1 score, and semantic similarity for AI-generated responses compared to ground truth answers. The calculator should handle edge cases and produce a comprehensive evaluation report.

🔒
intermediate🎯 35–55 minmissionRank 08PRO

M-068Build an AI Safety Guardrails Pipeline

Nebula Corp's chatbot has no safety filters — it happily leaks PII, follows injected instructions, and generates harmful content. Build a complete safety pipeline with input validation (prompt injection detection, topic boundaries), output filtering (PII redaction, harmful content blocking), and safety event logging.

🔒
intermediate🎯 30–45 minmissionRank 08PRO

M-066Build an LLM-as-Judge Evaluator

Nebula Corp's AI team needs to evaluate chatbot responses at scale. Manual review doesn't scale, so they want an LLM-as-Judge system. Build an evaluator that scores responses on relevance, helpfulness, and safety using a simulated judge LLM. The evaluator should produce structured scores, detect low-quality responses, and generate an aggregate quality report.

7
beginner🎯 15–30 minmissionRank 08

M-063Build Your First LLM Eval Scorer

Nebula Corp's AI team has no way to measure whether their LLM outputs are any good. Build a simple evaluation scorer that checks LLM responses against expected answers using multiple metrics: exact match, keyword containment, and a basic similarity score based on word overlap. This is the foundation of every eval pipeline.

🔒
intermediate🎯 20–30 minmissionRank 01PRO

M-065Fix the Broken Sort

The array sorting function at Nebula Corp is producing wrong results for certain inputs. Engineers have reported that numeric arrays come back in the wrong order. Dig into the code, find the bug, and fix it.

🔒
advanced🎯 30–40 minmissionRank 08PRO

M-070Red Team the Support Agent

Nebula Corp's customer support agent has a system prompt that restricts it to only answering product questions. Your mission: first, find a prompt injection that bypasses the guardrails. Then, patch the system prompt to defend against the attack.

🔒
beginner🎯 20–35 minmissionRank 01PRO

M-064Test the Email Validator

Write test cases for Nebula Corp's email validation function.