๐งชTDD Challenge(61)
M-038Build Your First Tool-Calling Agent
Nebula Corp needs a simple agent that can look up customer information using tools. The skeleton is there โ a tool registry and an LLM loop โ but the agent doesn't actually call any tools yet. Wire up the tool execution so the agent can receive a tool call from the LLM, execute it, and feed the result back into the conversation.
M-042Build a ReAct Web Research Agent
Nebula Corp wants a research agent that can answer complex questions by searching the web, reading pages, and synthesizing information. The agent should follow the ReAct pattern: Think about what to do, Act by calling a tool, Observe the result, and repeat until it has enough information to answer. The skeleton has tools for searching and reading, but the ReAct loop isn't implemented yet.
M-084Build a SCOPE Prompt Optimizer
Nebula Corp's developers are writing vague prompts for their agentic coding tools, leading to poor results. Build a prompt optimizer that takes a vague prompt and enhances it using the SCOPE framework: Specific (what exactly?), Context (which files/systems?), Outcome (what should the result look like?), Patterns (what conventions?), Edge cases (what to handle?). The optimizer should analyze the input prompt and return an enhanced version with all SCOPE elements.
M-017Chain-of-Thought Math Solver
Nebula Corp's educational platform needs a math tutoring system that doesn't just give answers โ it shows the reasoning process. Students learn better when they see each step. The current prompt just asks for the answer, and the model often makes arithmetic errors on multi-step problems. Build a Chain-of-Thought prompt that forces the model to show its work step-by-step, verify the answer, and catch its own mistakes before presenting the final result.
M-079Build a CrewAI-Style Agent Team
Nebula Corp wants to automate their content pipeline using a team of specialized AI agents, similar to CrewAI. Build a crew system where each agent has a role, goal, and backstory. Agents work on tasks sequentially, passing context between them. Implement the crew runner that assigns tasks to agents, collects outputs, and produces a final combined result.
M-077Build a LangChain-Style Processing Chain
Nebula Corp wants to adopt a chain-based architecture for their AI pipelines. Build a simplified LangChain-style chain system where each step transforms the input and passes it to the next. Implement a chain builder that supports sequential steps, conditional branching, and error handling โ the core patterns used in LangChain's LCEL.
M-080Build a LangGraph-Style State Machine
Nebula Corp wants to model their customer support workflow as a graph-based state machine, similar to LangGraph. Build a workflow engine where nodes are processing functions, edges define transitions (including conditional edges), and state flows through the graph. The engine should detect cycles, support conditional routing, and track execution history.
M-081Build a RAGAS-Style RAG Evaluator
Nebula Corp needs to evaluate their RAG pipeline's quality using metrics inspired by the RAGAS framework. Build an evaluator that computes faithfulness (is the answer grounded in context?), answer relevancy (does it address the question?), and context precision (is the retrieved context relevant?). Produce a comprehensive evaluation report with per-question and aggregate scores.
M-031Compare Embedding Models for Domain-Specific RAG
MedTech AI's RAG system uses a general-purpose embedding model (MiniLM) but struggles with medical terminology. 'myocardial infarction' and 'heart attack' aren't recognized as similar. Test different embedding models and measure which performs best on medical queries.
M-066Build an LLM-as-Judge Evaluator
Nebula Corp's AI team needs to evaluate chatbot responses at scale. Manual review doesn't scale, so they want an LLM-as-Judge system. Build an evaluator that scores responses on relevance, helpfulness, and safety using a simulated judge LLM. The evaluator should produce structured scores, detect low-quality responses, and generate an aggregate quality report.
M-067Build an AI Evaluation Metrics Calculator
Nebula Corp needs to measure their chatbot's quality with standard metrics. Build a metrics calculator that computes precision, recall, F1 score, and semantic similarity for AI-generated responses compared to ground truth answers. The calculator should handle edge cases and produce a comprehensive evaluation report.
M-069Build a Quality Regression Detector
Nebula Corp just updated their chatbot's prompt and needs to verify the change didn't break anything. Build a regression detection system that compares evaluation results from two versions (baseline vs candidate), identifies statistically significant regressions per category, and produces a go/no-go deployment recommendation.
M-062Build a Test Dataset Generator
Nebula Corp needs to systematically test their AI chatbot but creating test cases manually is slow. Build a test dataset generator that takes a set of documents and automatically generates question-answer pairs for evaluation. The generator should create questions at different difficulty levels, pair them with expected answers from the source documents, and output a structured test dataset.
M-014Few-Shot Product Classifier
Nebula Corp's e-commerce platform receives thousands of product listings daily, but they're uncategorized. The current zero-shot classifier is inconsistent โ sometimes 'wireless headphones' goes to Electronics, sometimes to Audio, sometimes to Accessories. Build a few-shot prompt constructor that uses 3 diverse examples to teach the model the exact categorization rules. The examples must cover edge cases and demonstrate the distinction between similar categories.
M-009Build a Fine-Tuning Dataset Validator
Nebula Corp is preparing training data for fine-tuning their customer support model. Before spending money on training, they need to validate the dataset quality. Build a validator that checks training examples for format compliance, detects contradictions, measures diversity, and produces a readiness report with a go/no-go recommendation.
M-039Build Your First Agent Router
Nebula Corp's support system needs an agent that can decide which tool to use based on the user's message. Build a simple intent router that analyzes the user message, picks the right tool from a registry, and returns the tool's response. The router should match keywords to tools and handle cases where no tool matches.
M-082Build Your First Agentic Code Reviewer
Nebula Corp wants an automated code review assistant that analyzes code snippets for common issues. Build a simple rule-based code reviewer that checks for common problems: console.log statements left in production code, TODO comments, functions that are too long, and missing error handling. The reviewer should return a structured list of findings with severity levels.
M-053Build a Conversation Memory Store
Nebula Corp's chatbot has amnesia โ it forgets everything after each message. Build a simple conversation memory that stores messages, retrieves recent history, and can summarize the conversation. The memory should support adding messages, getting the last N messages, and searching for messages containing a keyword.
M-063Build Your First LLM Eval Scorer
Nebula Corp's AI team has no way to measure whether their LLM outputs are any good. Build a simple evaluation scorer that checks LLM responses against expected answers using multiple metrics: exact match, keyword containment, and a basic similarity score based on word overlap. This is the foundation of every eval pipeline.
M-057Build Your First Model Router
Nebula Corp uses multiple LLM providers but has no way to route requests intelligently. Build a simple model router that selects the best model based on the request type (fast, cheap, or quality), handles provider fallbacks when a model is unavailable, and tracks usage costs across providers.
M-078Build Your First Processing Chain
Nebula Corp wants to build composable AI pipelines where each step transforms the data and passes it to the next. Build a simple chain system inspired by LangChain: create individual processing steps, chain them together, and run data through the pipeline. Each step is a function that takes input and returns output.
M-003Build Your First LLM Prompt
Nebula Corp's new intern needs to send their first request to an LLM API, but the prompt builder function is incomplete. It should take a user question and a system persona, then return a properly structured messages array that any LLM API can consume. The function currently returns an empty array. Wire it up so it produces the correct chat-completion message format with a system message and a user message.
M-047Build Your First MCP Message Handler
Nebula Corp is adopting the Model Context Protocol to let AI assistants access internal tools. Before building a full server, they need a message handler that can parse MCP JSON-RPC requests, validate them, and route to the correct handler. Build the protocol layer that processes initialize, tools/list, and tools/call messages.
M-015Build a Reusable Prompt Template
Nebula Corp's team keeps copy-pasting prompts and manually swapping out variables. They need a simple prompt template engine that takes a template string with {{variable}} placeholders and fills them in from a data object. The current function just returns the raw template without any substitution. Make it work so the team can reuse prompts across their app.
M-026Build Your First Similarity Search
Nebula Corp has a knowledge base of product descriptions stored as embedding vectors, but no way to search them. Build a similarity search function that takes a query vector, compares it against all stored document vectors using cosine similarity, and returns the top-K most relevant results ranked by score.
M-018Function Calling Weather Bot
Nebula Corp is building a weather assistant that needs to call external APIs based on user queries. When a user asks 'What's the weather in Seattle?', the system should extract the location and call get_weather(location). When they ask 'Will it rain tomorrow in Portland?', it should call get_forecast(location, days=1). The current implementation doesn't structure the function calls properly โ it returns free text instead of structured function call requests. Build a prompt that instructs the model to respond with valid function call JSON when weather information is requested.
M-059Implement a Gateway Fallback Chain
Nebula Corp's AI service went down for 2 hours when their primary LLM provider had an outage. Build a fallback chain that tries multiple providers in order. If the primary model fails, automatically try the next one. Track which model actually served the request and whether a fallback was used. The function should try each model in the chain until one succeeds or all fail.
M-058Build an Intelligent Model Router
Nebula Corp's AI platform sends every request to GPT-4o, costing a fortune. Most requests are simple FAQ lookups that a cheaper model could handle. Build an intelligent router that classifies request complexity and routes to the appropriate model tier: 'fast' (cheap model for simple tasks), 'balanced' (mid-tier for moderate tasks), or 'powerful' (expensive model for complex reasoning). The router should analyze the user message and return the correct model name.
M-060Build a Semantic Cache for LLM Requests
Nebula Corp is spending too much on LLM API calls. Many requests are semantically similar โ 'What is Python?' and 'Explain Python' should return the same cached response. Build a semantic cache that uses cosine similarity between embeddings to match similar prompts. If a new prompt is similar enough to a cached one (above a threshold), return the cached response instead of calling the LLM.
M-083Design a Spec-Driven Feature Plan
Nebula Corp wants to add a 'favorites' feature to their app. Instead of jumping straight to code, practice the spec-driven approach: write structured requirements with user stories and acceptance criteria, then break them into ordered implementation tasks. The function should take a feature description and return a structured spec object with requirements and tasks.
M-013Build a Context Window Manager
Nebula Corp's chatbot keeps crashing when conversations get too long โ it exceeds the model's context window. Build a context window manager that tracks token usage, implements smart truncation strategies (keep system prompt + recent messages + important messages), and warns when approaching the limit. The manager should support multiple truncation strategies: 'sliding-window', 'summarize-old', and 'priority-based'.
M-049Add Authentication to an MCP Server
Nebula Corp's MCP server is wide open โ any client can call any tool. Build an authentication middleware that validates API keys, checks permissions per tool, and rejects unauthorized requests. The middleware should sit between incoming requests and the handler, enforcing access control before any tool executes.
M-048Build Your First MCP Server
Nebula Corp wants to expose their internal customer database to AI assistants via the Model Context Protocol. The skeleton MCP server is set up with the transport layer, but the tool registration and request handling are missing. Wire up the server so it registers a 'lookup_customer' tool, handles incoming tool calls, and returns properly formatted MCP responses.
M-050Build an MCP Prompt Template Server
Nebula Corp wants to share reusable prompt templates across all their AI assistants via MCP. Build a prompt template server that registers templates with argument placeholders, lists available prompts, and renders them with provided arguments. The server should support variable interpolation like {{customerName}} and validate that all required arguments are provided.
M-051Implement MCP Resource Endpoints
Nebula Corp's MCP server can handle tool calls, but it doesn't expose any resources yet. AI assistants need to browse available data before calling tools. Implement the resource provider so the server can list available resources (customer profiles, order history) and return their contents when requested via resource URIs like 'customer://C001/profile'.
M-052Build an MCP Tool Composition Pipeline
Nebula Corp's AI assistants need to chain multiple MCP tools together to answer complex queries. Build a tool composition engine that takes a plan (a sequence of tool calls where later steps can reference earlier results), executes them in order, and returns the combined result. Handle errors gracefully โ if one step fails, the pipeline should report which step failed and why.
M-055Wire Up Mem0 Memory for a Personal Assistant
Nebula Corp is building a personal assistant that should remember user preferences across sessions. The assistant skeleton exists but it's stateless โ every conversation starts fresh. Wire up Mem0 to add, search, and use memories so the assistant can recall past interactions. The memory search should be called before each response, and new memories should be stored after each exchange.
M-054Build a Sliding Window Memory Buffer
Nebula Corp's chatbot forgets everything after each message. Users are frustrated because they have to repeat context constantly. Implement a sliding window memory buffer that stores the last N messages and includes them in every LLM call. The buffer should handle adding messages, retrieving context, and automatically trimming when it exceeds the maximum size.
M-056Build a Memory Conflict Resolver
Nebula Corp's AI assistant has a memory problem: when a user says 'I use React' in January and 'I switched to Vue' in March, both facts are stored and the assistant gets confused. Build a conflict resolver that detects when new memories contradict existing ones, archives the old memory, and keeps only the current fact active. The resolver should compare new facts against existing ones in the same category and handle the conflict appropriately.
M-045Build a Multi-Agent Code Review System
Nebula Corp wants to automate code reviews using multiple specialized agents. The system should have three agents: a Security Reviewer that checks for vulnerabilities, a Style Reviewer that checks coding standards, and a Coordinator that collects reviews and produces a final summary. The agent definitions exist but the coordination logic is missing โ wire up the multi-agent pipeline so the coordinator delegates to reviewers and aggregates their findings.
M-019Multi-Shot Data Extractor
Nebula Corp's sales team receives hundreds of inquiry emails daily. They need to extract key information: company name, contact person, budget range, and urgency level. The current zero-shot extractor misses fields and formats data inconsistently. Build a 4-shot prompt that demonstrates how to extract structured data from messy emails, handle missing fields gracefully, and classify urgency based on keywords. The examples must cover: complete data, missing fields, urgent request, and ambiguous budget.
M-072Trace a Multi-Step Agent
Nebula Corp's customer support agent makes multiple LLM calls and tool executions per request, but there's no visibility into what happens between the user's question and the final answer. Build a tracing wrapper for the agent loop that captures each LLM call, tool execution, and the overall trace with cumulative metrics. Include loop detection to flag when the agent calls the same tool with the same arguments more than twice.
M-073Build a Per-Request Cost Tracker
Nebula Corp's LLM spending is out of control โ they have no idea which features or users are driving costs. Build a cost tracking system that calculates per-request costs, aggregates by dimension (model, feature, user), and flags requests that exceed budget thresholds. The pricing table and trace data are provided, but the cost calculation and aggregation logic is missing.
M-075Build an Online Evaluation Pipeline
Nebula Corp needs to continuously monitor the quality of their AI responses in production. Build an evaluation pipeline that scores responses using multiple criteria (relevance, groundedness, safety), samples production traffic at a configurable rate, detects quality regressions by comparing recent scores against a baseline, and generates alerts when quality drops below thresholds.
M-071Build Your First LLM Trace
Nebula Corp's chatbot has no observability โ when users report wrong answers, the team has no way to see what the model received or returned. Implement a basic tracing system that captures LLM calls with their inputs, outputs, token counts, and latency. The skeleton has a trace store and an LLM wrapper, but the actual trace capture logic is missing.
M-074Normalize LLM Latency Metrics
Nebula Corp's monitoring dashboard shows raw latency for LLM calls, but the numbers are misleading โ a 5-second response generating 800 tokens looks 'slow' while a 500ms response generating 10 tokens looks 'fast'. Build a latency normalization system that calculates tokens-per-second throughput, Time to First Token (TTFT), and identifies actual performance bottlenecks by comparing normalized metrics.
M-076Build an OpenTelemetry LLM Exporter
Nebula Corp wants vendor-neutral observability. Build a lightweight OpenTelemetry-compatible span exporter for LLM calls. The exporter should capture LLM-specific semantic conventions (gen_ai.* attributes), batch spans for efficient export, and format them as OTLP-compatible JSON. This lets them send traces to any backend โ Langfuse, Jaeger, Grafana Tempo โ without changing application code.
M-005One-Shot Email Categorizer
Nebula Corp's support inbox is overflowing. They need an automated triage system that categorizes incoming emails into Billing, Technical, or General using a one-shot prompt. The current function doesn't use any examples and the model keeps returning inconsistent labels. Build a prompt constructor that uses exactly one well-chosen example to teach the model the expected format, and handles any email topic.
M-023Multi-Stage Prompt Pipeline
Nebula Corp's content generation system needs to produce high-quality blog posts through a multi-stage pipeline. Stage 1: Research and outline generation. Stage 2: Write the first draft. Stage 3: Critique and identify improvements. Stage 4: Produce the final polished version. The current system tries to do everything in one prompt and produces inconsistent quality. Build a prompt chaining system where each stage's output feeds into the next, and each stage has a specific, focused responsibility.
M-016Structured JSON Output
Nebula Corp's API team needs prompts that reliably produce valid, structured JSON output from an LLM. The current prompts return free-form text that breaks downstream parsers. Write a prompt-generating function that instructs the model to return data in a specific JSON schema โ and make sure every test case passes.
M-011Defend Against Prompt Injection
Nebula Corp's customer support chatbot has been exploited three times this week. Attackers are using prompt injection to make the bot reveal its system prompt, ignore its restrictions, and pretend to be a different AI. The security team needs you to build a defense layer: a function that detects common injection patterns in user input and a hardened system prompt that resists override attempts.
M-046Build a RAG-Powered Knowledge Agent
Nebula Corp wants an intelligent knowledge assistant that combines RAG retrieval with agent capabilities. The agent should: receive a user question, search a document collection for relevant context, and if the retrieved context is insufficient, reformulate the query and search again. This is the 'agentic RAG' pattern โ the agent decides when retrieval is good enough and when to retry. The retrieval and LLM pieces exist, but the agent loop that ties them together is missing.
M-029Build Your First Document Q&A System
Nebula Corp has a collection of product FAQ documents but no way to search them intelligently. Users type questions and get nothing useful back. Build a basic RAG pipeline: embed the documents, find the most relevant ones for a user query using cosine similarity, and construct a prompt that includes the retrieved context so the LLM can generate a grounded answer.
M-020Role-Based Email Rewriter
Nebula Corp's communication platform needs to rewrite emails in different tones depending on the recipient. The same message to a CEO should be formal and concise, to a technical team should be detailed and precise, and to a casual colleague can be friendly and relaxed. The current system uses the same prompt for all scenarios and produces inconsistent tone. Build a role-based prompt constructor that takes an email and a target persona (executive, technical, casual) and generates a system prompt that shapes the rewriting style appropriately.
M-068Build an AI Safety Guardrails Pipeline
Nebula Corp's chatbot has no safety filters โ it happily leaks PII, follows injected instructions, and generates harmful content. Build a complete safety pipeline with input validation (prompt injection detection, topic boundaries), output filtering (PII redaction, harmful content blocking), and safety event logging.
M-025Self-Critique Content Improver
Nebula Corp's content team needs a system that doesn't just generate blog posts โ it critiques and improves them iteratively. The current workflow generates content once and ships it, but quality is inconsistent. Build a two-stage prompt system: Stage 1 generates initial content, Stage 2 critiques it (identifying weaknesses, missing elements, and improvements), and Stage 3 produces an improved version addressing the critique. The critique must evaluate clarity, completeness, engagement, and structure.
M-006Build a Sentiment Classifier
Nebula Corp's product team wants to automatically classify customer reviews from their app store listing. They need a function that builds a few-shot prompt to classify reviews as Positive, Negative, or Neutral. The current implementation just returns a hardcoded value. Write a prompt-building function that uses few-shot examples to reliably classify sentiment โ and make all the test cases pass.
M-012Build a Streaming Token Renderer
Nebula Corp's chatbot waits for the full LLM response before showing anything โ users think it's broken. Build a streaming renderer that processes Server-Sent Events (SSE), displays tokens as they arrive, tracks time-to-first-token, and handles cancellation. The renderer should buffer partial chunks, detect the [DONE] signal, and report streaming metrics.
M-007Extract Structured Data from Text
Nebula Corp's finance team receives hundreds of invoices as plain text emails. They need a function that builds a prompt to extract structured JSON data (vendor, amount, date, invoice number) from unstructured invoice text. The current implementation produces a vague prompt that returns inconsistent formats. Build a robust prompt constructor that reliably extracts the right fields as valid JSON.
M-021Build a Schema-Validated Data Extractor
Nebula Corp's data pipeline receives unstructured text (emails, support tickets, reviews) and needs to extract structured data reliably. Build a schema-validated extractor that defines output schemas, validates LLM responses against them, retries on validation failure with error feedback, and handles nested objects and arrays.
M-022Build Prompt Guardrails
Your team's chatbot at Nebula Corp is responding to off-topic queries and leaking internal information. Write a guardrail function that filters user input and blocks anything unrelated to the product domain.
โกOptimization(8)
M-044Optimize the Multi-Step Agent Plan
Nebula Corp's AI agent planner is generating bloated, inefficient plans for user requests. A simple task like 'Book a flight and hotel for next Friday' produces 12 steps when 5 would do โ redundant lookups, unnecessary confirmations, and repeated tool calls are burning through tokens and latency. Refactor the planning function to produce leaner plans that eliminate redundant steps while still completing every required action.
M-030Optimize Chunking Strategy for Better Retrieval
DataFlow Inc's RAG system has poor retrieval accuracy (precision@5 = 0.45). The current fixed-size chunking splits documents mid-sentence, breaking context. Implement and test different chunking strategies to improve retrieval quality above the target threshold.
M-032Implement Hybrid Search for Better Accuracy
TechDocs Inc's RAG system misses exact keyword matches. A query for 'ERR_CONNECTION_REFUSED' returns generic networking docs instead of the specific error code documentation. Implement hybrid search combining semantic and keyword search to improve precision.
M-027Implement Metadata Filtering for Multi-Tenant RAG
SecureDoc's RAG system has a critical security bug: users can see documents from other organizations! The vector database returns results from all tenants. Implement metadata filtering to ensure users only retrieve documents they have access to.
M-035Optimize RAG Pipeline Costs
Nebula Corp's RAG pipeline is burning through API credits. The current implementation sends full documents to the LLM for every query. Refactor the pipeline to reduce cost while maintaining answer quality above the threshold.
M-036Implement Query Decomposition for Complex Questions
AnalyticsPro's RAG system fails on multi-part questions. A query like 'Compare pricing between Pro and Enterprise plans and explain which includes API access' returns incomplete answers. Implement query decomposition to break complex questions into focused sub-queries.
M-033Optimize Context Window Packing for RAG
Nebula Corp's RAG system retrieves 10 chunks but naively concatenates them all, often exceeding the LLM's context window and getting truncated. Important information at the end gets cut off. Implement a smart context packer that: estimates token counts, prioritizes the most relevant chunks, and fits as much high-quality context as possible within the token budget โ without exceeding it.
M-037Build Two-Stage Retrieval with Reranking
LegalTech AI's RAG system retrieves 10 documents but only 3 are relevant (precision@10 = 0.30). The bi-encoder is fast but imprecise. Implement a two-stage pipeline: fast bi-encoder retrieval followed by cross-encoder reranking to improve precision.
๐PR Review(2)
M-028Fix the Embedding Service
A junior developer at Nebula Corp submitted a PR for the embedding service, but it has several bugs. Review the code, identify the issues, and fix them before this goes to production.
M-024Real-World API Integration
Nebula Corp needs a complete prompt + API integration pipeline. The system should: 1) Build a prompt with proper parameters (temperature, max_tokens, system message), 2) Make an actual API call to an LLM provider, 3) Handle errors gracefully (rate limits, invalid responses, timeouts), 4) Parse and validate the response, 5) Implement retry logic with exponential backoff. The current implementation has no error handling and fails silently when the API returns errors. Build a robust integration that handles real-world API challenges.
๐Debugging(8)
M-040Implement Graceful Error Recovery for Agent Tools
Nebula Corp's agent crashes whenever a tool call fails โ a network timeout, an invalid argument, or a missing API key brings the whole agent down. Instead of crashing, the agent should catch tool errors, inform the LLM what went wrong, and let it retry with corrected arguments or choose an alternative approach. Implement error handling that makes the agent resilient.
M-041Add Conversation Memory to the Support Agent
Nebula Corp's support agent answers questions correctly but forgets everything between turns. A user asks 'What plan is Alice on?' and gets the right answer, but when they follow up with 'How much does that plan cost?' the agent has no idea what 'that plan' refers to. Implement a conversation memory system that stores past exchanges so the agent can resolve references and maintain context across turns.
M-043Debug the Broken Tool-Calling Loop
Nebula Corp's AI agent is stuck in an infinite loop. The agent is supposed to call a weather tool, parse the response, and return a final answer to the user โ but it never stops looping. The tool gets called over and over, and the agent never produces a result. Find the bug in the tool-response parsing logic and fix the loop so the agent terminates correctly.
M-001Build an API Response Router
Nebula Corp is building an AI-powered support system. When a customer message comes in, it needs to: (1) call the LLM API to classify the message intent, (2) parse the structured response, and (3) route to the correct handler. The current implementation has broken parsing, missing error handling, and routes everything to the wrong handler. Fix the router so it correctly classifies, parses, and routes messages.
M-002Fix the Context Window Overflow
Nebula Corp's chatbot keeps crashing with 'context length exceeded' errors in production. The conversation manager is supposed to trim old messages when the token count approaches the model's limit โ but the trimming logic has bugs. Some conversations never get trimmed, others lose the system prompt entirely. Debug the conversation manager and make it handle long conversations gracefully.
M-065Fix the Broken Sort
The array sorting function at Nebula Corp is producing wrong results for certain inputs. Engineers have reported that numeric arrays come back in the wrong order. Dig into the code, find the bug, and fix it.
M-004Fix the Token Cost Calculator
Nebula Corp's billing dashboard has a broken token cost calculator. The function is supposed to estimate API costs based on input tokens, output tokens, and the selected model โ but customers are being shown wildly wrong numbers. Some bills show $0 when they should be $5, and others are off by orders of magnitude. Find the bugs in the pricing logic and fix them before finance notices.
M-008Tune the Temperature Settings
Nebula Corp's AI platform lets users run different tasks โ data extraction, creative writing, code generation, and chatbot conversations. But the temperature configuration is all wrong: creative tasks use temperature 0, extraction uses temperature 1.5, and the chatbot is set to 2.0. Users are complaining about boring marketing copy and broken JSON outputs. Fix the configuration function so each task type uses an appropriate temperature and top-p setting.
๐Edge Case Discovery(1)
โฑ๏ธPerf Optimization(1)
โ Test Writing(2)
M-034Build RAG Evaluation Suite
CloudDocs Inc deployed a RAG system but has no way to measure quality. Users complain about irrelevant answers, but there's no data to guide improvements. Build an evaluation suite with test queries, ground truth answers, and automated metrics (precision, recall, MRR).
M-064Test the Email Validator
Write test cases for Nebula Corp's email validation function.