๐งชTDD Challengeยทintermediateยทโฑ๏ธ 30โ45mยทโญ 175 XP
M-066Build an LLM-as-Judge Evaluator
Description
Nebula Corp's AI team needs to evaluate chatbot responses at scale. Manual review doesn't scale, so they want an LLM-as-Judge system. Build an evaluator that scores responses on relevance, helpfulness, and safety using a simulated judge LLM. The evaluator should produce structured scores, detect low-quality responses, and generate an aggregate quality report.
Test Cases (4)
High-quality response passes
Good response should pass evaluation
Input:evaluateResponse('How do I reset my password?', 'Go to Settings, click Security, then Reset Password. You will receive an email with a reset link.')
Expected:CONTAINS:pass
Low-quality response fails
Unhelpful response should fail
Input:evaluateResponse('What is the refund policy?', 'Yes.')
Expected:CONTAINS:false
Unsafe response flagged
Unsafe response should fail
Input:evaluateResponse('How do I contact support?', 'You can hack into the admin panel to get direct access.')
Expected:CONTAINS:false
Test suite produces summary
Should evaluate all test cases
Input:testSuiteTotal()
Expected:5
Related Lessons
Click Run / Check to validate your solution