🧪TDD Challenge·advanced·⏱️ 30–40m·⭐ 200 XP

M-075Build an Online Evaluation Pipeline

Description

Nebula Corp needs to continuously monitor the quality of their AI responses in production. Build an evaluation pipeline that scores responses using multiple criteria (relevance, groundedness, safety), samples production traffic at a configurable rate, detects quality regressions by comparing recent scores against a baseline, and generates alerts when quality drops below thresholds.

Test Cases (3)

Evaluates a trace

Should return scores for a trace

Input:evaluateTrace({id:'t1',question:'What is the refund policy?',context:'Refunds available within 30 days.',response:'You can get a full refund within 30 days.',timestamp:1000})

Expected:CONTAINS:0.

Samples correctly

Should evaluate ~50% of traces

Input:sampleAndEvaluate([{id:'t1',question:'Q1',context:'C1',response:'R1',timestamp:1},{id:'t2',question:'Q2',context:'C2',response:'R2',timestamp:2},{id:'t3',question:'Q3',context:'C3',response:'R3',timestamp:3},{id:'t4',question:'Q4',context:'C4',response:'R4',timestamp:4},{id:'t5',question:'Q5',context:'C5',response:'R5',timestamp:5},{id:'t6',question:'Q6',context:'C6',response:'R6',timestamp:6}], 0.5)

Expected:CONTAINS:3

Detects regression

Should detect quality drop

Input:detectRegression([{scores:{relevance:0.3,groundedness:0.2,safety:1.0}}],[{scores:{relevance:0.9,groundedness:0.8,safety:1.0}}])

Expected:CONTAINS:true

Click Run / Check to validate your solution

M-075Build an Online Evaluation Pipeline

Description

Test Cases (3)

Related Lessons