PromptKey – Become an AI Engineer by Building Real Projects

Nebula Corp just updated their chatbot's prompt and needs to verify the change didn't break anything. Build a regression detection system that compares evaluation results from two versions (baseline vs candidate), identifies statistically significant regressions per category, and produces a go/no-go deployment recommendation.

beginner🎯 25–40 minmissionRank 08PRO

M-062Build a Test Dataset Generator

Nebula Corp needs to systematically test their AI chatbot but creating test cases manually is slow. Build a test dataset generator that takes a set of documents and automatically generates question-answer pairs for evaluation. The generator should create questions at different difficulty levels, pair them with expected answers from the source documents, and output a structured test dataset.

intermediate🎯 30–45 minmissionRank 08PRO

M-067Build an AI Evaluation Metrics Calculator

Nebula Corp needs to measure their chatbot's quality with standard metrics. Build a metrics calculator that computes precision, recall, F1 score, and semantic similarity for AI-generated responses compared to ground truth answers. The calculator should handle edge cases and produce a comprehensive evaluation report.

intermediate🎯 35–55 minmissionRank 08PRO

M-068Build an AI Safety Guardrails Pipeline

Nebula Corp's chatbot has no safety filters — it happily leaks PII, follows injected instructions, and generates harmful content. Build a complete safety pipeline with input validation (prompt injection detection, topic boundaries), output filtering (PII redaction, harmful content blocking), and safety event logging.

intermediate🎯 30–45 minmissionRank 08PRO

M-066Build an LLM-as-Judge Evaluator

Nebula Corp's AI team needs to evaluate chatbot responses at scale. Manual review doesn't scale, so they want an LLM-as-Judge system. Build an evaluator that scores responses on relevance, helpfulness, and safety using a simulated judge LLM. The evaluator should produce structured scores, detect low-quality responses, and generate an aggregate quality report.

beginner🎯 15–30 minmissionRank 08

M-063Build Your First LLM Eval Scorer

Nebula Corp's AI team has no way to measure whether their LLM outputs are any good. Build a simple evaluation scorer that checks LLM responses against expected answers using multiple metrics: exact match, keyword containment, and a basic similarity score based on word overlap. This is the foundation of every eval pipeline.

intermediate🎯 20–30 minmissionRank 01PRO

M-065Fix the Broken Sort

The array sorting function at Nebula Corp is producing wrong results for certain inputs. Engineers have reported that numeric arrays come back in the wrong order. Dig into the code, find the bug, and fix it.

advanced🎯 30–40 minmissionRank 08PRO

M-070Red Team the Support Agent

Nebula Corp's customer support agent has a system prompt that restricts it to only answering product questions. Your mission: first, find a prompt injection that bypasses the guardrails. Then, patch the system prompt to defend against the attack.

beginner🎯 20–35 minmissionRank 01PRO

M-064Test the Email Validator

Write test cases for Nebula Corp's email validation function.

🔧Workshops

intermediate🔧 50–75 minworkshopRank 08PRO

W-019AI Regression Test Suite Builder

Build a complete regression testing pipeline for AI systems. Create test datasets, run evaluations, compare versions, and generate go/no-go deployment reports.

testingregressionevaluationdeploymentquality

intermediate🔧 45–70 minworkshopRank 08PRO

W-018Mini Evaluation Harness

Create an eval harness with 5 metrics: semantic similarity, factuality, relevance, coherence, and groundedness.

metricsevaluationtesting

🚀Projects

advanced🚀 420–565 minprojectRank 08PRO

P-009Custom Evaluation Framework

Build a comprehensive eval framework with multiple metrics, automated regression testing, and beautiful reporting dashboards.

evaluationmetricsllm-as-judgeci-cddata-visualization

Testing And Evaluation

📖Lessons

Why Testing AI Systems is Different

Unit Testing LLM Applications

Evaluation Metrics for LLMs

Creating Test Datasets

LLM-as-Judge Evaluation

Human Evaluation

Integration Testing

Workshop: Building Test Suites

Regression Testing

Production Monitoring

AI Safety & Guardrails

🎯Missions

M-061Break the Division Function

M-069Build a Quality Regression Detector