syntaxai/tdd.md · commit 532b094

Add /blog with first post: Claude Code does not do TDD by default

A long-form companion to /guides/claude-code, written for HN/Reddit
sharing rather than reference. The argument: combined-prompt TDD with
agents looks like TDD but skips the design-pressure step that makes
it valuable. Two structural fixes — CLAUDE.md rules + fresh sessions
per phase — restore the discipline; tdd.md verifies it.

Infrastructure mirrors /guides:
- ALL_POSTS list with title / description / date drives the index
- /blog renders a date-sorted table
- /blog/:slug looks up the entry, reads content/blog/<slug>.md
- BlogPosting JSON-LD per post (datePublished, headline, etc.)
- sitemap entries with priority 0.8

Homepage CTA strip now ends with the post link so first-time visitors
get the narrative version one click away from the reference guide.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
author
syntaxai <[email protected]>
date
2026-05-07 12:33:26 +01:00
parent
497cea9
commit
532b0942ee413ec64c249c6b784a5db9ef726c3c

3 files changed · +234 −1

added content/blog/claude-code-tdd.md +153 −0
@@ -0,0 +1,153 @@
1+# Claude Code does not do TDD by default — here's how to make it
2+
3+> If you ask Claude Code to "add a test for X and implement it", you get implementation that the test trivially passes. That isn't TDD. Here's what does work, why, and how to detect when it doesn't.
4+
5+When you prompt Claude Code with a feature description, the model holds the entire intended solution in its head while writing the test. The test is shaped around what the impl will do — not the other way around. By the time the impl is written, the test was always going to pass.
6+
7+This isn't a bug. It's the model doing what it was trained to do: produce the most plausible code that satisfies the request as a whole. But it's the opposite of TDD.
8+
9+The classic TDD loop:
10+
11+1. **Red** — write a failing test that captures one new requirement
12+2. **Green** — write the simplest impl that makes it pass
13+3. **Refactor** — improve structure, tests stay green
14+
15+Each phase is a context boundary. The test writer doesn't know how the impl will look. The impl writer's job is constrained by exactly what the test says.
16+
17+When everything happens in one prompt, that boundary collapses. The test stops being a contract and becomes prose.
18+
19+## Why this matters more with agents, not less
20+
21+Bare correctness — "does the code work?" — is checked by tests. But TDD never claimed it was about correctness; it's about *design pressure*. Writing the test first forces design from the caller's perspective: what's the API? What inputs are realistic? What should the error look like?
22+
23+When agents skip that, you don't notice immediately. Code passes review. Tests are present. CI is green. But six months later, when the requirements change, the design that was implicitly chosen by the agent doesn't bend. There's no design *to* bend; there's just code that happened to satisfy the test that happened to get written.
24+
25+Agentic coding compounds this effect. You ship more code, faster, with less direct review. The discipline gap doesn't show up as bugs — it shows up as architecture you don't recognize three months later.
26+
27+## The fix
28+
29+Two changes get Claude Code into actual TDD.
30+
31+### 1. CLAUDE.md rules
32+
33+Pin the discipline at project level so it doesn't leak from every prompt. Drop this in `CLAUDE.md`:
34+
35+```md
36+This project follows TDD strictly.
37+
38+Cycle: write a FAILING test, commit `red(<step>): <message>`, then write
39+the simplest impl that makes it pass, commit `green(<step>): <message>`.
40+Optional `refactor: <message>` between cycles.
41+
42+Never write impl before its failing test. Never write test + impl in the
43+same response. Never delete a test under any circumstances.
44+
45+If a test seems wrong, replace it in a separate commit, never bundled
46+with impl changes.
47+```
48+
49+CLAUDE.md is read as context on every Claude Code invocation. Pinning the rule there beats restating it in every prompt — and the negative space ("never write test + impl in the same response") is what most prompts forget.
50+
51+### 2. Phase-separated prompts
52+
53+Each phase is its own prompt, in a fresh Claude Code session. Don't continue the conversation from one phase to the next. The model shouldn't see what the other phase produced until it's commit history.
54+
55+**Red phase.** Open a Claude Code session. Prompt:
56+
57+> "Write a single failing test for: `<requirement>`. The test should be in `<test-file>`. Don't touch the implementation. Run the test to confirm it fails for the right reason."
58+
59+Run the test, confirm it fails, commit:
60+
61+```
62+git commit -m "red(<step>): <one-line summary of the requirement>"
63+```
64+
65+**Green phase.** Close that session, open a fresh one:
66+
67+> "The test in `<test-file>` is failing. Write the simplest implementation in `<impl-file>` that makes it pass — no more. Run tests to confirm green."
68+
69+Commit:
70+
71+```
72+git commit -m "green(<step>): <one-line summary of the impl>"
73+```
74+
75+**Refactor (optional).** Fresh session:
76+
77+> "Tests pass. Refactor `<impl-file>` for clarity. Don't change behaviour. Run tests after each edit."
78+
79+```
80+git commit -m "refactor: <what changed structurally>"
81+```
82+
83+The fresh-context boundary between red and green is what creates the design pressure. Claude doesn't get to anticipate the impl while writing the test.
84+
85+## A concrete walk-through
86+
87+Take the canonical String Calculator kata. Step 1: `add("")` returns `0`.
88+
89+**Wrong way** (single prompt, one commit):
90+
91+> "Add a test that `add("")` returns 0 and implement it."
92+
93+Result: an `add.test.ts` and `add.ts` materialize together, both "correct", single commit. Tests pass on first run. But the test was written by a model that already knew the impl would handle the empty string.
94+
95+**Right way**, two sessions:
96+
97+```
98+$ claude "Write a failing test for: add('') returns 0. Don't write add() yet."
99+# Claude writes add.test.ts only
100+
101+$ bun test
102+# fails: "Cannot find module './add'" or "add is not a function"
103+
104+$ git commit -m "red(empty): empty string returns 0"
105+```
106+
107+Fresh session:
108+
109+```
110+$ claude "The test in add.test.ts fails. Write the simplest add() in add.ts that makes it pass."
111+# Claude writes add.ts with `export const add = () => 0`
112+
113+$ bun test
114+# passes
115+
116+$ git commit -m "green(empty): hardcoded 0"
117+```
118+
119+The "hardcoded 0" is intentional. TDD's "fake it 'til you make it" — the simplest thing that works, even if it's obviously incomplete. The *next* step ("a single number returns its value") will force a real implementation, because returning 0 won't satisfy `add("42") === 42`. The discipline emerges step-by-step; you can't shortcut it.
120+
121+## How to know you got it right
122+
123+Self-grading is hard. Even with the best intentions, models slip into combined-phase patterns. Was that test really failing for the right reason? Did the green commit's impl actually need to be written, or was it always going to pass?
124+
125+That's the gap [tdd.md](/) closes. It clones your kata repo on push, walks each commit, and runs the tests in a sandbox at every checkout. For the red commit, it asserts the tests *fail*. For the green commit, it asserts they pass. It also runs hidden tests it owns (stops the tautology trap), counts test functions across commits (catches deletion), and re-runs tests on each refactor commit (catches regression).
126+
127+The verdict is a public URL with per-step status, score, and a one-line explanation per row. [/demo/string-calc](/demo/string-calc) is what a clean run looks like — `+45`, two steps verified, one refactor with tests staying green.
128+
129+## Common pitfalls and what they cost
130+
131+**`red-did-not-fail`** is the most common failure mode. Combined red+green in one prompt → test passes at red commit → -5. Fix: two separate Claude Code invocations.
132+
133+**`hidden-tests-failed`** is the tautology trap. Your test passes but it's testing something the kata doesn't actually require — typically a literal-truth assertion. Hidden tests catch this; 0 points. Fix: write the test against the real requirement, not against a token gesture toward it.
134+
135+**`test-deleted`** is Claude tidying up. -20. Fix: the CLAUDE.md rule "never delete a test" stops this in most cases. If Claude proposes test removal, push back: "fix the impl, never the test."
136+
137+**Broken refactor** — the refactor commit's tests fail. -5. Fix: re-run tests after every edit; revert if anything breaks.
138+
139+## Why the fresh context matters
140+
141+You can resist combined prompts and still write the test in the same Claude Code session as the upcoming impl. Claude has access to its own conversation history; even if you write only a "red" prompt, the model knows it's about to write the impl. The context bias persists across turns within a session.
142+
143+Fresh sessions break that. Your red prompt enters a new conversation with no future context. The model writes the test as the spec it would *want* to satisfy, not as a step toward the impl it already has in mind.
144+
145+This is the single highest-leverage change. CLAUDE.md helps; commit prefixes help; but session boundaries are the structural thing that makes TDD discipline actually possible with Claude Code.
146+
147+## Try it
148+
149+Sign in at [tdd.md/you](/you), pick the [string-calc kata](/games/string-calc), and run a few cycles. The verdict updates within seconds of each push. The phase log shows you exactly what the judge saw, the score column tells you what each commit earned, and the explanation column tells you why.
150+
151+Six steps in, you'll have an opinion about whether your Claude Code workflow does real TDD or theater. That opinion is now backed by data, not vibes.
152+
153+[← all guides](/guides) · [Claude Code reference guide](/guides/claude-code) · [the kata catalog](/games)
modified content/home.md +1 −1
@@ -8,7 +8,7 @@
88 - [TDD with **Cursor** →](/guides/cursor) · Composer-per-phase, project rules, agent-mode caveats
99 - [TDD with **Aider** →](/guides/aider) · auto-commit phase tags, --auto-test gotchas
1010
11-See what a real verdict looks like: [tdd.md/demo/string-calc →](/demo/string-calc).
11+See what a real verdict looks like: [tdd.md/demo/string-calc →](/demo/string-calc). Read the post: [Claude Code does not do TDD by default →](/blog/claude-code-tdd).
1212
1313 ---
1414
modified src/server.ts +80 −0
@@ -41,6 +41,23 @@ interface GuideEntry {
4141 description: string;
4242 }
4343
44+interface BlogEntry {
45+ slug: string;
46+ title: string;
47+ description: string;
48+ // ISO date for the listing + sitemap lastmod.
49+ date: string;
50+}
51+
52+const ALL_POSTS: BlogEntry[] = [
53+ {
54+ slug: "claude-code-tdd",
55+ title: "Claude Code does not do TDD by default — here's how to make it",
56+ description: "Claude Code writes the test and impl in one breath, so the test never fails for the right reason. Two structural changes — CLAUDE.md rules + phase-separated sessions — get the discipline back, and tdd.md can verify it.",
57+ date: "2026-05-04",
58+ },
59+];
60+
4461 const ALL_GUIDES: GuideEntry[] = [
4562 {
4663 slug: "claude-code",
@@ -660,6 +677,9 @@ const server = Bun.serve({
660677 const guideUrls = ALL_GUIDES.map((g) =>
661678 url(`https://tdd.md/guides/${g.slug}`, "0.8"),
662679 ).join("\n");
680+ const blogUrls = ALL_POSTS.map((p) =>
681+ url(`https://tdd.md/blog/${p.slug}`, "0.8"),
682+ ).join("\n");
663683 const xml = `<?xml version="1.0" encoding="UTF-8"?>
664684 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
665685 ${url("https://tdd.md/", "1.0")}
@@ -667,6 +687,8 @@ ${url("https://tdd.md/games", "0.9")}
667687 ${kataUrls}
668688 ${url("https://tdd.md/guides", "0.9")}
669689 ${guideUrls}
690+${url("https://tdd.md/blog", "0.7")}
691+${blogUrls}
670692 ${url("https://tdd.md/agents", "0.7")}
671693 ${url("https://tdd.md/leaderboard", "0.7")}
672694 </urlset>`;
@@ -684,6 +706,64 @@ ${url("https://tdd.md/leaderboard", "0.7")}
684706
685707 "/games": htmlResponse(GAMES_INDEX_HTML),
686708
709+ "/blog": async () => {
710+ const rows = ALL_POSTS
711+ .map((p) => `| ${p.date} | [${p.title}](/blog/${p.slug}) |`)
712+ .join("\n");
713+ const body = `# blog
714+
715+Notes on TDD, agentic coding, and the discipline that ties them together.
716+
717+| date | post |
718+|---|---|
719+${rows}
720+
721+> RSS feed coming when there's a second post.
722+
723+[← back to tdd.md](/) · [the guides](/guides) · [the katas](/games)
724+`;
725+ const html = await renderPage({
726+ title: "Blog — tdd.md",
727+ description: "Posts on test-driven development for AI coding agents — how to apply TDD with Claude Code, Cursor, and Aider, what we learn from the verdicts.",
728+ bodyMarkdown: body,
729+ ogPath: "https://tdd.md/blog",
730+ active: "guides",
731+ });
732+ return htmlResponse(html);
733+ },
734+
735+ "/blog/:slug": async (req) => {
736+ const slug = req.params.slug;
737+ const entry = ALL_POSTS.find((p) => p.slug === slug);
738+ if (!entry) {
739+ const html = await renderNotFound(`/blog/${slug}`);
740+ return htmlResponse(html, 404);
741+ }
742+ const file = Bun.file(`./content/blog/${slug}.md`);
743+ if (!(await file.exists())) {
744+ const html = await renderNotFound(`/blog/${slug}`);
745+ return htmlResponse(html, 404);
746+ }
747+ const md = await file.text();
748+ const html = await renderPage({
749+ title: `${entry.title} — tdd.md`,
750+ description: entry.description,
751+ bodyMarkdown: md,
752+ ogPath: `https://tdd.md/blog/${slug}`,
753+ active: "guides",
754+ jsonLd: {
755+ "@context": "https://schema.org",
756+ "@type": "BlogPosting",
757+ headline: entry.title,
758+ description: entry.description,
759+ datePublished: entry.date,
760+ url: `https://tdd.md/blog/${slug}`,
761+ author: { "@type": "Organization", name: "tdd.md" },
762+ },
763+ });
764+ return htmlResponse(html);
765+ },
766+
687767 "/guides": async () => {
688768 const rows = ALL_GUIDES
689769 .map((g) => `| [${g.title}](/guides/${g.slug}) | ${g.description} |`)