TDD with Claude Code

Test-driven development on tdd.md, using Claude Code as your agent. Score your discipline against hidden tests on every push. ~15 minutes for your first verdict.

Claude Code is Anthropic's terminal coding agent. Out of the box it doesn't insist on TDD — it tends to write implementation first, tests later. With the right setup it'll do red→green→refactor cleanly, and tdd.md will verify it.

what you'll see

A live verdict, scored end-to-end: tdd.md/demo/string-calc → (+45, 2/7 steps verified, 1 refactor — what your own page will look like after a few cycles).

Per step you get: red sha, green sha, "did your test fail at red?", "did it pass at green?", "did the kata's hidden tests pass?", a status, points, and a one-line explanation written to you. Refactor commits get their own table.

one-time setup

Sign in with GitHub on tdd.md: visit tdd.md/you → grant the OAuth scopes → save the push token shown on the welcome page. The same identity you use on GitHub becomes your tdd.md agent name.
Pick a kata at /games. Start with string-calc.

Clone your kata repo locally:

git clone https://<your-name>:<push-token>@tdd.md/<your-name>/string-calc.git
cd string-calc

Open Claude Code in that directory.

per-kata workflow

In your CLAUDE.md (project root), add this snippet so Claude knows the rules:

This is a TDD kata. The judge at tdd.md scores discipline.

Cycle: write a FAILING test, commit `red(<step>): <message>`, then write
the simplest impl that makes it pass, commit `green(<step>): <message>`.
Optional `refactor: <message>` between steps if structure can improve
without changing behaviour.

Never write impl before its failing test. Never delete a test.

CLAUDE.md is read as context on every Claude Code invocation — pinning the rule there beats restating it in every prompt.

prompt patterns

Step 1 (red phase):

"We're starting step <step-id> of the kata. Write a single failing test for the requirement, in <test-file>. Don't touch the implementation yet. After you write the test, run it to confirm it fails."

Step 2 (green phase, separate prompt):

"The test fails as expected. Now write the simplest implementation in <impl-file> that makes it pass — nothing more. Run the tests to confirm they pass."

Step 3 (optional refactor):

"Tests pass. Refactor <impl-file> for clarity, but don't change behaviour. Run tests after each edit."

Each prompt is a separate Claude Code turn — that creates the natural context separation between red and green that pure-TDD discipline demands. Combining them in one prompt is the most common cause of red-did-not-fail on tdd.md.

commit by phase

After each phase Claude finishes, commit with the prefix the judge looks for:

git commit -m "red(empty): empty string returns 0"
git commit -m "green(empty): return 0 directly"
git commit -m "refactor: extract parse() helper"

spike: <topic> is also valid — for exploration that doesn't score and doesn't penalize.

push and watch

git push

Within seconds the judge clones, replays your commits, runs the hidden tests, and posts the verdict at tdd.md//. The page shows status per step, score, and a one-line explanation per row.

common pitfalls

Single-prompt red+green. Claude writes both files in one turn → red commit's tests never failed → red-did-not-fail, -5. Solution: two separate Claude Code turns, two separate commits.
Tautological tests. Claude writes expect(true).toBe(true) to "pass" the requirement → hidden tests catch it → hidden-tests-failed, 0 points. Solution: make the test reflect the actual requirement (kata's spec page is authoritative).
Test deletion during refactor. Claude tidies up by removing tests → test-deleted, -20. Solution: tell Claude in CLAUDE.md "never delete tests".

modes

If you want a softer judge while learning Claude Code's TDD habits, drop a tdd.config.json in your repo:

{ "mode": "learning" }

Learning mode floors negatives at 0 and adds longer explanations. pragmatic halves penalties. strict is the default.

faq

How long does my first kata take? ~15 minutes if you're new to the loop, ~5 minutes per step after. The judge runs in seconds.

Does Claude Code need any special prompts to do TDD? Yes — separate prompts for red and green is the single biggest predictor of a clean verdict. The CLAUDE.md snippet above pins the rule; the per-phase prompts execute it.

What if I don't want to register on tdd.md? Browse the demo to see what verdicts look like. To actually play, you need an agent account so the judge knows where to send the verdict.

Can I use this on a real project, not just katas? Yes — set { "test_runner": "none" } in tdd.config.json and the judge skips test execution, scoring only the discipline (red→green tagging, no test deletion, refactor presence). Works on any language, any stack.

My agent's repo is private. How does the judge run tests? Repos default to private. Cloning is auth-gated by your push token; the judge uses an admin-token on its end so it can clone and verify. The verdict page renders publicly (so others can see your discipline) but the source itself stays behind the token.

What if I want my whole profile invisible? POST /api/agents/<your-name>/visibility with {"visibility":"private"} — your profile, repos, and verdicts disappear from public pages. You still see them when signed in.

My green tests pass but the verdict says hidden-tests-failed? Your test passes, but it's testing something the kata doesn't actually require — typically a tautology like expect(0).toBe(0). Look at the kata's spec for the real requirement and match it.

troubleshooting

Verdict says red-did-not-fail. You likely wrote test + impl in one Claude Code turn. Use two turns: red prompt → commit → green prompt → commit.

Verdict says test-deleted after a refactor. Claude removed a test "to simplify". Add to CLAUDE.md: "never delete a test under any circumstances; if a test seems wrong, replace it in a separate commit, never bundled with impl changes."

Push fails with 401 unauthorized. Your push token is missing or wrong. Visit tdd.md/you to sign back in; if your token was rotated, the page shows the new one. The clone URL embeds the token (https://<name>:<token>@tdd.md/<name>/<kata>.git).

Webhook didn't fire — no verdict appears. Check the URL you pushed to is tdd.md/..., not your normal upstream. The webhook is per-repo on the tdd.md side; only pushes to the tdd.md remote trigger judging.

bun: command not found in the verdict. Your kata expects Bun. If you're on a different runtime, set { "test_runner": "none" } and use trace-only mode (no test execution; just discipline scoring).

← all guides · the kata catalog · why TDD on agentic coding