Task Criteria

Our criteria is similar to, but not the same as the terminal-bench requirements:

Basic requirements:

  1. Realism. Tasks should represent real tasks that real developers do on a daily basis. They should not be toy problems or contrived examples.
  2. Verifiability. Tasks should be strongly and deterministically verifiable. This usually means we want unit or integration tests to determine the task is complete. We need tests that are not too precise and not too loose. Good tests should fail on reward hacking (cheating), but also should pass when the model comes up with a new, but valid solution.
  3. Solvability. Tasks should contain all of the information needed to solve the problem (no missing dependencies, information, etc.). An expert human or human + agent combination (with all the tools at their disposal) should be able to solve the problem given sufficient time.
    1. We should not have tasks that require you to have more context than provided. The task should be able to be completed using the codebase and tools provided, and not require the agent to have to ask the user for more information.
  4. Difficulty. We’re not looking for simple LeetCode problems. Tasks should be difficult for the models to solve, which usually involves:
    1. Diving into a large, complex codebase and making several tool calls
    2. Making edits to several files
    3. Performing work that requires visual understanding (i.e. frontend development)
    4. Tasks that require recent knowledge of the world (e.g. new SDKs)
    5. Using rare or new languages or tools
    6. Tasks that require reasoning through time or debugging long-running processes

Other requirements:

  1. No Train-Test Leakage.
    1. We do not want to have copies (or near-copies) of problems that already exist in benchmarks such as Terminal Bench or SWE-Bench.
  2. Diversity.
    1. We want each task to be unique if possible. The prompt/instructions should be unique for each task. It’s okay to make variants of a single task, but in those cases we should make sure each variant is sufficiently varied, especially in the task instructions, solutions, and tests. They don’t need to have many lines that are different; they just need to be different in purpose and outcome.
    2. Be especially aware of making the structure of the instructions too similar. As noted below, copy-paste is perfectly fine, but we should not have the same structure for multiple instructions when the content is different.
  3. Fast to run.
    1. We don’t want tests that will take a lot of time to run, unless it’s intentional. Currently we’re aiming for installation to take under 2 minutes, total agent time under 6 minutes, and running tests under 3 minutes. This means we should try to avoid very large packages, or tasks that accidentally take a long time to run or test.

Terminal Bench Basics

CI pipeline (NOP and Oracle check)

⚠️ Why does this matter? These two checks are required since they determine whether our task is valid and solvable. NOP check makes sure our tests are valid, as they should fail without a solution. Otherwise, we might pass it at 100%. Oracle check makes sure our problem is actually solvable. Otherwise, we might never pass this task.

NOP should fail (if it fails correctly, you get a green check), as it does not make any code changes. This is because you should have tests in run-tests.sh that depend on some code change that has not yet been applied (since it lives in solution.sh). As such, we expect our tests to fail out of the box, making the task valid.

Oracle should pass, since we apply solution.sh . If Oracle does not pass, that means either our task is mis-configured, or the solution is not actually correct.

If either of these fail, we recommend testing locally (use uv run tb run in order to avoid dependency issues), and then passing the results to Claude Code or similar to fix itself. There are three common problems:

  1. NOP is unexpectedly passing because you did not properly call your tests, or the tests are not failing as they should be.
  2. Oracle is failing, but not because the tests are working. Oracle is failing because you broke the harness (broken or missing dependencies, for example).
  3. Oracle is failing because solution.sh is broken or wrong. This is the most difficult one to fix.