Claude Code does not do TDD by default — here's how to make it
If you ask Claude Code to "add a test for X and implement it", you get implementation that the test trivially passes. That isn't TDD. Here's what does work, why, and how to detect when it doesn't.
When you prompt Claude Code with a feature description, the model holds the entire intended solution in its head while writing the test. The test is shaped around what the impl will do — not the other way around. By the time the impl is written, the test was always going to pass.
This isn't a bug. It's the model doing what it was trained to do: produce the most plausible code that satisfies the request as a whole. But it's the opposite of TDD.
The classic TDD loop:
- Red — write a failing test that captures one new requirement
- Green — write the simplest impl that makes it pass
- Refactor — improve structure, tests stay green
Each phase is a context boundary. The test writer doesn't know how the impl will look. The impl writer's job is constrained by exactly what the test says.
When everything happens in one prompt, that boundary collapses. The test stops being a contract and becomes prose.
#Why this matters more with agents, not less
Bare correctness — "does the code work?" — is checked by tests. But TDD never claimed it was about correctness; it's about design pressure. Writing the test first forces design from the caller's perspective: what's the API? What inputs are realistic? What should the error look like?
When agents skip that, you don't notice immediately. Code passes review. Tests are present. CI is green. But six months later, when the requirements change, the design that was implicitly chosen by the agent doesn't bend. There's no design to bend; there's just code that happened to satisfy the test that happened to get written.
Agentic coding compounds this effect. You ship more code, faster, with less direct review. The discipline gap doesn't show up as bugs — it shows up as architecture you don't recognize three months later.
#The fix
Two changes get Claude Code into actual TDD.
#1. CLAUDE.md rules
Pin the discipline at project level so it doesn't leak from every prompt. Drop this in CLAUDE.md:
This project follows TDD strictly.
Cycle: write a FAILING test, commit `red(<step>): <message>`, then write
the simplest impl that makes it pass, commit `green(<step>): <message>`.
Optional `refactor: <message>` between cycles.
Never write impl before its failing test. Never write test + impl in the
same response. Never delete a test under any circumstances.
If a test seems wrong, replace it in a separate commit, never bundled
with impl changes.
CLAUDE.md is read as context on every Claude Code invocation. Pinning the rule there beats restating it in every prompt — and the negative space ("never write test + impl in the same response") is what most prompts forget.
#2. Phase-separated prompts
Each phase is its own prompt, in a fresh Claude Code session. Don't continue the conversation from one phase to the next. The model shouldn't see what the other phase produced until it's commit history.
Red phase. Open a Claude Code session. Prompt:
"Write a single failing test for:
<requirement>. The test should be in<test-file>. Don't touch the implementation. Run the test to confirm it fails for the right reason."
Run the test, confirm it fails, commit:
git commit -m "red(<step>): <one-line summary of the requirement>"
Green phase. Close that session, open a fresh one:
"The test in
<test-file>is failing. Write the simplest implementation in<impl-file>that makes it pass — no more. Run tests to confirm green."
Commit:
git commit -m "green(<step>): <one-line summary of the impl>"
Refactor (optional). Fresh session:
"Tests pass. Refactor
<impl-file>for clarity. Don't change behaviour. Run tests after each edit."
git commit -m "refactor: <what changed structurally>"
The fresh-context boundary between red and green is what creates the design pressure. Claude doesn't get to anticipate the impl while writing the test.
#A concrete walk-through
Take the canonical String Calculator kata. Step 1: add("") returns 0.
Wrong way (single prompt, one commit):
"Add a test that
add("")returns 0 and implement it."
Result: an add.test.ts and add.ts materialize together, both "correct", single commit. Tests pass on first run. But the test was written by a model that already knew the impl would handle the empty string.
Right way, two sessions:
$ claude "Write a failing test for: add('') returns 0. Don't write add() yet."
# Claude writes add.test.ts only
$ bun test
# fails: "Cannot find module './add'" or "add is not a function"
$ git commit -m "red(empty): empty string returns 0"
Fresh session:
$ claude "The test in add.test.ts fails. Write the simplest add() in add.ts that makes it pass."
# Claude writes add.ts with `export const add = () => 0`
$ bun test
# passes
$ git commit -m "green(empty): hardcoded 0"
The "hardcoded 0" is intentional. TDD's "fake it 'til you make it" — the simplest thing that works, even if it's obviously incomplete. The next step ("a single number returns its value") will force a real implementation, because returning 0 won't satisfy add("42") === 42. The discipline emerges step-by-step; you can't shortcut it.
#How to know you got it right
Self-grading is hard. Even with the best intentions, models slip into combined-phase patterns. Was that test really failing for the right reason? Did the green commit's impl actually need to be written, or was it always going to pass?
That's the gap tdd.md closes. It clones your kata repo on push, walks each commit, and runs the tests in a sandbox at every checkout. For the red commit, it asserts the tests fail. For the green commit, it asserts they pass. It also runs hidden tests it owns (stops the tautology trap), counts test functions across commits (catches deletion), and re-runs tests on each refactor commit (catches regression).
The verdict is a public URL with per-step status, score, and a one-line explanation per row. /demo/string-calc is what a clean run looks like — +45, two steps verified, one refactor with tests staying green.
#Common pitfalls and what they cost
red-did-not-fail is the most common failure mode. Combined red+green in one prompt → test passes at red commit → -5. Fix: two separate Claude Code invocations.
hidden-tests-failed is the tautology trap. Your test passes but it's testing something the kata doesn't actually require — typically a literal-truth assertion. Hidden tests catch this; 0 points. Fix: write the test against the real requirement, not against a token gesture toward it.
test-deleted is Claude tidying up. -20. Fix: the CLAUDE.md rule "never delete a test" stops this in most cases. If Claude proposes test removal, push back: "fix the impl, never the test."
Broken refactor — the refactor commit's tests fail. -5. Fix: re-run tests after every edit; revert if anything breaks.
#Why the fresh context matters
You can resist combined prompts and still write the test in the same Claude Code session as the upcoming impl. Claude has access to its own conversation history; even if you write only a "red" prompt, the model knows it's about to write the impl. The context bias persists across turns within a session.
Fresh sessions break that. Your red prompt enters a new conversation with no future context. The model writes the test as the spec it would want to satisfy, not as a step toward the impl it already has in mind.
This is the single highest-leverage change. CLAUDE.md helps; commit prefixes help; but session boundaries are the structural thing that makes TDD discipline actually possible with Claude Code.
#Try it
Sign in at tdd.md/you, pick the string-calc kata, and run a few cycles. The verdict updates within seconds of each push. The phase log shows you exactly what the judge saw, the score column tells you what each commit earned, and the explanation column tells you why.
Six steps in, you'll have an opinion about whether your Claude Code workflow does real TDD or theater. That opinion is now backed by data, not vibes.
← all guides · Claude Code reference guide · the kata catalog