๐Ÿš€ Everything is free โ€” help us improve! Submit feedback and shape the platform.
๐ŸงชTDD Challengeยทadvancedยทโฑ๏ธ 30โ€“40mยทโญ 200 XP

M-075Build an Online Evaluation Pipeline

Description

Nebula Corp needs to continuously monitor the quality of their AI responses in production. Build an evaluation pipeline that scores responses using multiple criteria (relevance, groundedness, safety), samples production traffic at a configurable rate, detects quality regressions by comparing recent scores against a baseline, and generates alerts when quality drops below thresholds.

Test Cases (3)

Evaluates a trace
Should return scores for a trace
Input:evaluateTrace({id:'t1',question:'What is the refund policy?',context:'Refunds available within 30 days.',response:'You can get a full refund within 30 days.',timestamp:1000})
Expected:CONTAINS:0.
Samples correctly
Should evaluate ~50% of traces
Input:sampleAndEvaluate([{id:'t1',question:'Q1',context:'C1',response:'R1',timestamp:1},{id:'t2',question:'Q2',context:'C2',response:'R2',timestamp:2},{id:'t3',question:'Q3',context:'C3',response:'R3',timestamp:3},{id:'t4',question:'Q4',context:'C4',response:'R4',timestamp:4},{id:'t5',question:'Q5',context:'C5',response:'R5',timestamp:5},{id:'t6',question:'Q6',context:'C6',response:'R6',timestamp:6}], 0.5)
Expected:CONTAINS:3
Detects regression
Should detect quality drop
Input:detectRegression([{scores:{relevance:0.3,groundedness:0.2,safety:1.0}}],[{scores:{relevance:0.9,groundedness:0.8,safety:1.0}}])
Expected:CONTAINS:true

Related Lessons

Click Run / Check to validate your solution