๐งชTDD Challengeยทadvancedยทโฑ๏ธ 30โ40mยทโญ 200 XP
M-075Build an Online Evaluation Pipeline
Description
Nebula Corp needs to continuously monitor the quality of their AI responses in production. Build an evaluation pipeline that scores responses using multiple criteria (relevance, groundedness, safety), samples production traffic at a configurable rate, detects quality regressions by comparing recent scores against a baseline, and generates alerts when quality drops below thresholds.
Test Cases (3)
Evaluates a trace
Should return scores for a trace
Input:evaluateTrace({id:'t1',question:'What is the refund policy?',context:'Refunds available within 30 days.',response:'You can get a full refund within 30 days.',timestamp:1000})
Expected:CONTAINS:0.
Samples correctly
Should evaluate ~50% of traces
Input:sampleAndEvaluate([{id:'t1',question:'Q1',context:'C1',response:'R1',timestamp:1},{id:'t2',question:'Q2',context:'C2',response:'R2',timestamp:2},{id:'t3',question:'Q3',context:'C3',response:'R3',timestamp:3},{id:'t4',question:'Q4',context:'C4',response:'R4',timestamp:4},{id:'t5',question:'Q5',context:'C5',response:'R5',timestamp:5},{id:'t6',question:'Q6',context:'C6',response:'R6',timestamp:6}], 0.5)
Expected:CONTAINS:3
Detects regression
Should detect quality drop
Input:detectRegression([{scores:{relevance:0.3,groundedness:0.2,safety:1.0}}],[{scores:{relevance:0.9,groundedness:0.8,safety:1.0}}])
Expected:CONTAINS:true
Related Lessons
Click Run / Check to validate your solution