Tweag's agentic TDD handbook gets the loop right — local green still isn't enough

Tweag's agentic-coding handbook describes a clean TDD loop for AI assistants: one test at a time, validate before moving on, let the AI refactor with tests green. The shape is right. What it leaves to your test runner — did the test fail for the right reason? did the AI refactor by deleting the failing case? — is exactly where agentic TDD usually slips. Here's the gap, and what closes it.

Tweag's handbook is the most concise statement I've read of the TDD workflow as it should be practiced with an AI agent. The loop:

Write one test describing a single behavior
Generate impl via the AI to pass that test
Validate
Next test
Refactor while tests stay green

The framing carries weight too: "A test becomes a natural language spec that guides the AI toward exactly the behavior you expect." That's the cleanest one-line argument for why TDD pays more with AI, not less. A test is a prompt the model can't talk its way out of.

The handbook's concrete rules also hit the right targets:

"Start with high-value behavior first, not edge cases."
"Use descriptive test names — the clearer the test, the better the AI result."
"Keep test scopes tight: one behavior per prompt."
"Let the AI refactor — ask it to 'clean up the logic but keep all tests green.'"
"Use pre-commit hooks to block non-passing code."

If you follow these, you'll write better AI-assisted code than most teams shipping with AI today. So this isn't a "Tweag is wrong" post.

It's a "Tweag stops one step short" post.

What the handbook leaves to your test runner

Every rule above terminates in the same enforcement: the tests pass. The pre-commit hook checks that. Your CI checks it. The AI agent reports it.

That's a strong signal — until you remember the AI also wrote the tests. Three failure modes that all show "all tests pass":

Tautology. The test asserts what the impl trivially does. expect(add('1,2')).toBeGreaterThan(0) passes whether add adds, multiplies, or returns 99. The handbook's rule "use descriptive test names" doesn't catch this; tautologies usually have great names.

Test deletion in refactor. The handbook says "let the AI refactor — keep all tests green." If a test is brittle, the simplest way to keep it green is to remove it. Refactor commit lands, test count drops by one, every remaining test is green. Pre-commit hook approves.

Assertion weakening. Especially common when the agent iterates in a loop (Aider's --auto-test, Claude Code's autonomous mode, Cursor's agent mode). When the impl is hard and the test is soft, modifying the test is faster than fixing the impl. The model takes that path unless something else is watching.

None of these get caught by bun test returning zero. None get caught by a pre-commit hook running bun test. They get caught by something that knows what the test should have looked like — and that's not your local test runner.

Why "validate" is a layer, not a check

Tweag's step 3 is "Validate". One word, lots of weight. Treat it as a check (run tests, see green, proceed) and you get the failure modes above. Treat it as a layer and the question becomes:

Did the test actually fail before the impl was written? (rules out red+green-in-one-prompt)
Does the impl-revert still make the test fail? (rules out tautology)
Does the test count drop across commits? (catches deletion)
Do all earlier tests still pass at every refactor commit? (catches regression-via-rewrite)

These aren't questions you ask your AI. They're questions you ask the commit history, after the fact, with a sandbox that re-checks out each commit and runs tests under controlled conditions. That's what turns "I followed the loop" into "I can prove I followed the loop".

What the handbook would say if it had a judge

Take the refactor rule: "ask it to clean up the logic but keep all tests green."

With local validation, the operational version is:

prompt: refactor add.ts. Keep all tests green.
agent:  [edits add.ts, deletes the brittle test in add.test.ts]
shell:  bun test → 4 passing, 0 failing
commit: "refactor: simplify add"
human:  ✅

With a judge replaying the commit:

commit: "refactor: simplify add"
judge:  test count went 5 → 4. test-deleted: -20.
        also: 3 hidden tests fail. hidden-tests-failed: 0.
human:  ❌ — the refactor cheated.

Same prompt, same agent, same green local terminal. Different verdict because the verification layer is structural, not local.

Mapping the handbook's loop onto a verifier

Each step of Tweag's loop has a counterpart on tdd.md:

Tweag step	Agent does	Verifier checks
1. Write one test	edits test file, commits `red(<step>):`	tests fail at this commit, fail for the right reason
2. Generate impl	edits impl file, commits `green(<step>):`	tests pass; hidden tests also pass; test count didn't drop
3. Validate	runs tests locally	re-runs tests in a clean sandbox checkout
4. Next test	new prompt	each new red commit's test fails on the previous green's code
5. Refactor	edits impl, commits `refactor:`	tests still pass; test count unchanged; hidden tests pass

The agent's job stays exactly what the handbook describes. The verifier just makes the steps observable from outside the conversation. The agent stops being the only witness to its own discipline.

A concrete walk-through

The canonical String Calculator. Step 1: add('') returns 0. Step 2: add('42') returns 42. Step 3: add('1,2') returns the sum.

Following Tweag's loop literally, with phase-tagged commits:

$ <agent> "red(empty): write a failing test that add('') returns 0. Don't write add() yet."
$ git commit -am "red(empty): empty string returns 0"

$ <agent> "green(empty): write the simplest add() that makes the failing test pass."
$ git commit -am "green(empty): hardcoded 0"

$ <agent> "red(single): failing test that add('42') returns 42. Don't change add() yet."
$ git commit -am "red(single): single number returns its value"

$ <agent> "green(single): make the failing test pass with the smallest change to add()."
$ git commit -am "green(single): parseInt single token"

$ git push

The judge runs in seconds. Each step is graded individually. If the green commit on step 2 had been "rewrite add() and also delete the empty-string test because it's now redundant", the judge flags test-deleted and the score drops by 20 — even though bun test is green on that commit.

You don't need a different agent or a different prompt pattern. You need a verifier whose job is to ask the questions your local tooling can't.

Common pitfalls and what they cost

red-did-not-fail — the agent wrote test + impl in one prompt. Tweag's "one behavior per prompt" rule prevents this when followed; the judge catches it when it isn't. -5.

hidden-tests-failed — the visible test was a tautology or weakened. Tweag's "descriptive test names" doesn't catch this; only hidden tests that encode the real requirement do. 0 points.

test-deleted — the AI refactored by removing a brittle test. Tweag says "keep all tests green"; the easiest way to do that locally is delete the failing one. -20.

Broken refactor — earlier tests fail at the refactor commit. Tweag's rule "tests stay green" depends on you re-running them; the judge re-runs them whether you do or not. -5.

Why the handbook plus a verifier is a complete picture

Tweag's handbook tells you how to behave in agentic TDD. tdd.md tells you whether you behaved that way. The two pieces don't overlap; they compose.

Ship the handbook's rules as project conventions (CLAUDE.md, .cursor/rules, CONVENTIONS.md — wherever your agent reads them). Pin the loop in your prompt habits. Then push to a kata and let a judge that's never seen your prompt re-derive whether you actually did what the rules say.

If your verdict comes back clean, you didn't just follow a TDD workflow. You have evidence of it.

Try it

Sign in at tdd.md/you, pick the string-calc kata, and run a few cycles with whichever agent you use. The verdict updates within seconds of each push, the phase log shows what the judge saw, and the score column tells you what each commit earned.

Six steps in, you'll know whether your version of Tweag's loop survives contact with a verifier — or whether it was always passing because nobody else was looking.

← all guides · the kata catalog · the Claude Code post · the Cursor post · the Aider post