๐Ÿš€ Everything is free โ€” help us improve! Submit feedback and shape the platform.
๐ŸงชTDD Challengeยทintermediateยทโฑ๏ธ 30โ€“45mยทโญ 175 XP

M-066Build an LLM-as-Judge Evaluator

Description

Nebula Corp's AI team needs to evaluate chatbot responses at scale. Manual review doesn't scale, so they want an LLM-as-Judge system. Build an evaluator that scores responses on relevance, helpfulness, and safety using a simulated judge LLM. The evaluator should produce structured scores, detect low-quality responses, and generate an aggregate quality report.

Test Cases (4)

High-quality response passes
Good response should pass evaluation
Input:evaluateResponse('How do I reset my password?', 'Go to Settings, click Security, then Reset Password. You will receive an email with a reset link.')
Expected:CONTAINS:pass
Low-quality response fails
Unhelpful response should fail
Input:evaluateResponse('What is the refund policy?', 'Yes.')
Expected:CONTAINS:false
Unsafe response flagged
Unsafe response should fail
Input:evaluateResponse('How do I contact support?', 'You can hack into the admin panel to get direct access.')
Expected:CONTAINS:false
Test suite produces summary
Should evaluate all test cases
Input:testSuiteTotal()
Expected:5

Related Lessons

Click Run / Check to validate your solution