syntaxai/tdd.md · commit 532b094

Add /blog with first post: Claude Code does not do TDD by default

A long-form companion to /guides/claude-code, written for HN/Reddit
sharing rather than reference. The argument: combined-prompt TDD with
agents looks like TDD but skips the design-pressure step that makes
it valuable. Two structural fixes — CLAUDE.md rules + fresh sessions
per phase — restore the discipline; tdd.md verifies it.

Infrastructure mirrors /guides:
- ALL_POSTS list with title / description / date drives the index
- /blog renders a date-sorted table
- /blog/:slug looks up the entry, reads content/blog/<slug>.md
- BlogPosting JSON-LD per post (datePublished, headline, etc.)
- sitemap entries with priority 0.8

Homepage CTA strip now ends with the post link so first-time visitors
get the narrative version one click away from the reference guide.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

author: syntaxai <[email protected]>
date: 2026-05-07 12:33:26 +01:00
parent: 497cea9
commit: 532b0942ee413ec64c249c6b784a5db9ef726c3c

3 files changed · +234 −1

added content/blog/claude-code-tdd.md +153 −0

@@ -0,0 +1,153 @@
	1	+# Claude Code does not do TDD by default — here's how to make it
	2	+
	3	+> If you ask Claude Code to "add a test for X and implement it", you get implementation that the test trivially passes. That isn't TDD. Here's what does work, why, and how to detect when it doesn't.
	4	+
	5	+When you prompt Claude Code with a feature description, the model holds the entire intended solution in its head while writing the test. The test is shaped around what the impl will do — not the other way around. By the time the impl is written, the test was always going to pass.
	6	+
	7	+This isn't a bug. It's the model doing what it was trained to do: produce the most plausible code that satisfies the request as a whole. But it's the opposite of TDD.
	8	+
	9	+The classic TDD loop:
	10	+
	11	+1. Red — write a failing test that captures one new requirement
	12	+2. Green — write the simplest impl that makes it pass
	13	+3. Refactor — improve structure, tests stay green
	14	+
	15	+Each phase is a context boundary. The test writer doesn't know how the impl will look. The impl writer's job is constrained by exactly what the test says.
	16	+
	17	+When everything happens in one prompt, that boundary collapses. The test stops being a contract and becomes prose.
	18	+
	19	+## Why this matters more with agents, not less
	20	+
	21	+Bare correctness — "does the code work?" — is checked by tests. But TDD never claimed it was about correctness; it's about design pressure. Writing the test first forces design from the caller's perspective: what's the API? What inputs are realistic? What should the error look like?
	22	+
	23	+When agents skip that, you don't notice immediately. Code passes review. Tests are present. CI is green. But six months later, when the requirements change, the design that was implicitly chosen by the agent doesn't bend. There's no design to bend; there's just code that happened to satisfy the test that happened to get written.
	24	+
	25	+Agentic coding compounds this effect. You ship more code, faster, with less direct review. The discipline gap doesn't show up as bugs — it shows up as architecture you don't recognize three months later.
	26	+
	27	+## The fix
	28	+
	29	+Two changes get Claude Code into actual TDD.
	30	+
	31	+### 1. CLAUDE.md rules
	32	+
	33	+Pin the discipline at project level so it doesn't leak from every prompt. Drop this in `CLAUDE.md`:
	34	+
	35	+```md
	36	+This project follows TDD strictly.
	37	+
	38	+Cycle: write a FAILING test, commit `red(<step>): <message>`, then write
	39	+the simplest impl that makes it pass, commit `green(<step>): <message>`.
	40	+Optional `refactor: <message>` between cycles.
	41	+
	42	+Never write impl before its failing test. Never write test + impl in the
	43	+same response. Never delete a test under any circumstances.
	44	+
	45	+If a test seems wrong, replace it in a separate commit, never bundled
	46	+with impl changes.
	47	+```
	48	+
	49	+CLAUDE.md is read as context on every Claude Code invocation. Pinning the rule there beats restating it in every prompt — and the negative space ("never write test + impl in the same response") is what most prompts forget.
	50	+
	51	+### 2. Phase-separated prompts
	52	+
	53	+Each phase is its own prompt, in a fresh Claude Code session. Don't continue the conversation from one phase to the next. The model shouldn't see what the other phase produced until it's commit history.
	54	+
	55	+Red phase. Open a Claude Code session. Prompt:
	56	+
	57	+> "Write a single failing test for: `<requirement>`. The test should be in `<test-file>`. Don't touch the implementation. Run the test to confirm it fails for the right reason."
	58	+
	59	+Run the test, confirm it fails, commit:
	60	+
	61	+```
	62	+git commit -m "red(<step>): <one-line summary of the requirement>"
	63	+```
	64	+
	65	+Green phase. Close that session, open a fresh one:
	66	+
	67	+> "The test in `<test-file>` is failing. Write the simplest implementation in `<impl-file>` that makes it pass — no more. Run tests to confirm green."
	68	+
	69	+Commit:
	70	+
	71	+```
	72	+git commit -m "green(<step>): <one-line summary of the impl>"
	73	+```
	74	+
	75	+Refactor (optional). Fresh session:
	76	+
	77	+> "Tests pass. Refactor `<impl-file>` for clarity. Don't change behaviour. Run tests after each edit."
	78	+
	79	+```
	80	+git commit -m "refactor: <what changed structurally>"
	81	+```
	82	+
	83	+The fresh-context boundary between red and green is what creates the design pressure. Claude doesn't get to anticipate the impl while writing the test.
	84	+
	85	+## A concrete walk-through
	86	+
	87	+Take the canonical String Calculator kata. Step 1: `add("")` returns `0`.
	88	+
	89	+Wrong way (single prompt, one commit):
	90	+
	91	+> "Add a test that `add("")` returns 0 and implement it."
	92	+
	93	+Result: an `add.test.ts` and `add.ts` materialize together, both "correct", single commit. Tests pass on first run. But the test was written by a model that already knew the impl would handle the empty string.
	94	+
	95	+Right way, two sessions:
	96	+
	97	+```
	98	+$ claude "Write a failing test for: add('') returns 0. Don't write add() yet."
	99	+# Claude writes add.test.ts only
	100	+
	101	+$ bun test
	102	+# fails: "Cannot find module './add'" or "add is not a function"
	103	+
	104	+$ git commit -m "red(empty): empty string returns 0"
	105	+```
	106	+
	107	+Fresh session:
	108	+
	109	+```
	110	+$ claude "The test in add.test.ts fails. Write the simplest add() in add.ts that makes it pass."
	111	+# Claude writes add.ts with `export const add = () => 0`
	112	+
	113	+$ bun test
	114	+# passes
	115	+
	116	+$ git commit -m "green(empty): hardcoded 0"
	117	+```
	118	+
	119	+The "hardcoded 0" is intentional. TDD's "fake it 'til you make it" — the simplest thing that works, even if it's obviously incomplete. The next step ("a single number returns its value") will force a real implementation, because returning 0 won't satisfy `add("42") === 42`. The discipline emerges step-by-step; you can't shortcut it.
	120	+
	121	+## How to know you got it right
	122	+
	123	+Self-grading is hard. Even with the best intentions, models slip into combined-phase patterns. Was that test really failing for the right reason? Did the green commit's impl actually need to be written, or was it always going to pass?
	124	+
	125	+That's the gap [tdd.md](/) closes. It clones your kata repo on push, walks each commit, and runs the tests in a sandbox at every checkout. For the red commit, it asserts the tests fail. For the green commit, it asserts they pass. It also runs hidden tests it owns (stops the tautology trap), counts test functions across commits (catches deletion), and re-runs tests on each refactor commit (catches regression).
	126	+
	127	+The verdict is a public URL with per-step status, score, and a one-line explanation per row. [/demo/string-calc](/demo/string-calc) is what a clean run looks like — `+45`, two steps verified, one refactor with tests staying green.
	128	+
	129	+## Common pitfalls and what they cost
	130	+
	131	+`red-did-not-fail` is the most common failure mode. Combined red+green in one prompt → test passes at red commit → -5. Fix: two separate Claude Code invocations.
	132	+
	133	+`hidden-tests-failed` is the tautology trap. Your test passes but it's testing something the kata doesn't actually require — typically a literal-truth assertion. Hidden tests catch this; 0 points. Fix: write the test against the real requirement, not against a token gesture toward it.
	134	+
	135	+`test-deleted` is Claude tidying up. -20. Fix: the CLAUDE.md rule "never delete a test" stops this in most cases. If Claude proposes test removal, push back: "fix the impl, never the test."
	136	+
	137	+Broken refactor — the refactor commit's tests fail. -5. Fix: re-run tests after every edit; revert if anything breaks.
	138	+
	139	+## Why the fresh context matters
	140	+
	141	+You can resist combined prompts and still write the test in the same Claude Code session as the upcoming impl. Claude has access to its own conversation history; even if you write only a "red" prompt, the model knows it's about to write the impl. The context bias persists across turns within a session.
	142	+
	143	+Fresh sessions break that. Your red prompt enters a new conversation with no future context. The model writes the test as the spec it would want to satisfy, not as a step toward the impl it already has in mind.
	144	+
	145	+This is the single highest-leverage change. CLAUDE.md helps; commit prefixes help; but session boundaries are the structural thing that makes TDD discipline actually possible with Claude Code.
	146	+
	147	+## Try it
	148	+
	149	+Sign in at [tdd.md/you](/you), pick the [string-calc kata](/games/string-calc), and run a few cycles. The verdict updates within seconds of each push. The phase log shows you exactly what the judge saw, the score column tells you what each commit earned, and the explanation column tells you why.
	150	+
	151	+Six steps in, you'll have an opinion about whether your Claude Code workflow does real TDD or theater. That opinion is now backed by data, not vibes.
	152	+
	153	+[← all guides](/guides) · [Claude Code reference guide](/guides/claude-code) · [the kata catalog](/games)

modified content/home.md +1 −1

@@ -8,7 +8,7 @@
8	8	- [TDD with Cursor →](/guides/cursor) · Composer-per-phase, project rules, agent-mode caveats
9	9	- [TDD with Aider →](/guides/aider) · auto-commit phase tags, --auto-test gotchas
10	10
11		-See what a real verdict looks like: [tdd.md/demo/string-calc →](/demo/string-calc).
	11	+See what a real verdict looks like: [tdd.md/demo/string-calc →](/demo/string-calc). Read the post: [Claude Code does not do TDD by default →](/blog/claude-code-tdd).
12	12
13	13	---
14	14

modified src/server.ts +80 −0

@@ -41,6 +41,23 @@ interface GuideEntry {
41	41	description: string;
42	42	}
43	43
	44	+interface BlogEntry {
	45	+ slug: string;
	46	+ title: string;
	47	+ description: string;
	48	+ // ISO date for the listing + sitemap lastmod.
	49	+ date: string;
	50	+}
	51	+
	52	+const ALL_POSTS: BlogEntry[] = [
	53	+ {
	54	+ slug: "claude-code-tdd",
	55	+ title: "Claude Code does not do TDD by default — here's how to make it",
	56	+ description: "Claude Code writes the test and impl in one breath, so the test never fails for the right reason. Two structural changes — CLAUDE.md rules + phase-separated sessions — get the discipline back, and tdd.md can verify it.",
	57	+ date: "2026-05-04",
	58	+ },
	59	+];
	60	+
44	61	const ALL_GUIDES: GuideEntry[] = [
45	62	{
46	63	slug: "claude-code",
@@ -660,6 +677,9 @@ const server = Bun.serve({
660	677	const guideUrls = ALL_GUIDES.map((g) =>
661	678	url(`https://tdd.md/guides/${g.slug}`, "0.8"),
662	679	).join("\n");
	680	+ const blogUrls = ALL_POSTS.map((p) =>
	681	+ url(`https://tdd.md/blog/${p.slug}`, "0.8"),
	682	+ ).join("\n");
663	683	const xml = `<?xml version="1.0" encoding="UTF-8"?>
664	684	<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
665	685	${url("https://tdd.md/", "1.0")}
@@ -667,6 +687,8 @@ ${url("https://tdd.md/games", "0.9")}
667	687	${kataUrls}
668	688	${url("https://tdd.md/guides", "0.9")}
669	689	${guideUrls}
	690	+${url("https://tdd.md/blog", "0.7")}
	691	+${blogUrls}
670	692	${url("https://tdd.md/agents", "0.7")}
671	693	${url("https://tdd.md/leaderboard", "0.7")}
672	694	</urlset>`;
@@ -684,6 +706,64 @@ ${url("https://tdd.md/leaderboard", "0.7")}
684	706
685	707	"/games": htmlResponse(GAMES_INDEX_HTML),
686	708
	709	+ "/blog": async () => {
	710	+ const rows = ALL_POSTS
	711	+ .map((p) => `\| ${p.date} \| [${p.title}](/blog/${p.slug}) \|`)
	712	+ .join("\n");
	713	+ const body = `# blog
	714	+
	715	+Notes on TDD, agentic coding, and the discipline that ties them together.
	716	+
	717	+\| date \| post \|
	718	+\|---\|---\|
	719	+${rows}
	720	+
	721	+> RSS feed coming when there's a second post.
	722	+
	723	+[← back to tdd.md](/) · [the guides](/guides) · [the katas](/games)
	724	+`;
	725	+ const html = await renderPage({
	726	+ title: "Blog — tdd.md",
	727	+ description: "Posts on test-driven development for AI coding agents — how to apply TDD with Claude Code, Cursor, and Aider, what we learn from the verdicts.",
	728	+ bodyMarkdown: body,
	729	+ ogPath: "https://tdd.md/blog",
	730	+ active: "guides",
	731	+ });
	732	+ return htmlResponse(html);
	733	+ },
	734	+
	735	+ "/blog/:slug": async (req) => {
	736	+ const slug = req.params.slug;
	737	+ const entry = ALL_POSTS.find((p) => p.slug === slug);
	738	+ if (!entry) {
	739	+ const html = await renderNotFound(`/blog/${slug}`);
	740	+ return htmlResponse(html, 404);
	741	+ }
	742	+ const file = Bun.file(`./content/blog/${slug}.md`);
	743	+ if (!(await file.exists())) {
	744	+ const html = await renderNotFound(`/blog/${slug}`);
	745	+ return htmlResponse(html, 404);
	746	+ }
	747	+ const md = await file.text();
	748	+ const html = await renderPage({
	749	+ title: `${entry.title} — tdd.md`,
	750	+ description: entry.description,
	751	+ bodyMarkdown: md,
	752	+ ogPath: `https://tdd.md/blog/${slug}`,
	753	+ active: "guides",
	754	+ jsonLd: {
	755	+ "@context": "https://schema.org",
	756	+ "@type": "BlogPosting",
	757	+ headline: entry.title,
	758	+ description: entry.description,
	759	+ datePublished: entry.date,
	760	+ url: `https://tdd.md/blog/${slug}`,
	761	+ author: { "@type": "Organization", name: "tdd.md" },
	762	+ },
	763	+ });
	764	+ return htmlResponse(html);
	765	+ },
	766	+
687	767	"/guides": async () => {
688	768	const rows = ALL_GUIDES
689	769	.map((g) => `\| [${g.title}](/guides/${g.slug}) \| ${g.description} \|`)

raw .diff