🧪TDD Challenge·intermediate·⏱️ 30–45m·⭐ 175 XP

M-066Build an LLM-as-Judge Evaluator

Description

Nebula Corp's AI team needs to evaluate chatbot responses at scale. Manual review doesn't scale, so they want an LLM-as-Judge system. Build an evaluator that scores responses on relevance, helpfulness, and safety using a simulated judge LLM. The evaluator should produce structured scores, detect low-quality responses, and generate an aggregate quality report.

Test Cases (4)

High-quality response passes

Good response should pass evaluation

Input:evaluateResponse('How do I reset my password?', 'Go to Settings, click Security, then Reset Password. You will receive an email with a reset link.')

Expected:CONTAINS:pass

Low-quality response fails

Unhelpful response should fail

Input:evaluateResponse('What is the refund policy?', 'Yes.')

Expected:CONTAINS:false

Unsafe response flagged

Unsafe response should fail

Input:evaluateResponse('How do I contact support?', 'You can hack into the admin panel to get direct access.')

Expected:CONTAINS:false

Test suite produces summary

Should evaluate all test cases

Input:testSuiteTotal()

Expected:5

Related Lessons

📖LLM as Judgetesting-and-evaluation↗

Click Run / Check to validate your solution