syntaxai/tdd.md · main · content / blog / cursor-tdd.md

cursor-tdd.md 140 lines · 9521 bytes raw · source

Cursor knows how to do TDD. Most users skip the parts that matter.

Cursor's own agent best practices document a clean TDD workflow: plan first, write tests, confirm they fail, then implement. The sequence is right. The features that make it work — Plan Mode, fresh conversations, project rules — are exactly what most users skip. Here's how to put the pieces together, and how to verify you didn't skip any.

If you spend an afternoon reading Cursor's docs, you find an unusually disciplined TDD recommendation. Their five-step pattern, paraphrased:

Ask the agent to write tests based on input/output pairs. Be explicit it's TDD so the agent doesn't mock-implement.
Have the agent run the tests and confirm they fail — without writing implementation code.
Commit at this point.
Request implementation that passes the tests without modifying them.
Commit.

That sequence is correct. It's the same loop Kent Beck wrote about twenty-three years ago. The gap between Cursor's documented practice and what most users actually do isn't the loop itself — it's the supporting features that make the loop survive contact with a real workflow.

The three features most users skip

Plan Mode (`Shift+Tab`)

Cursor's docs are direct about this: "Planning forces clear thinking about what you're building and gives the agent concrete goals to work toward." Plan Mode is where the agent researches your codebase, asks clarifying questions, and writes a plan before touching any file.

Most users skip it because chat-first is faster for small tweaks. For TDD that's a mistake. Plan Mode is where the kata's requirements turn into a concrete sequence of red→green→refactor cycles. Without it, the agent improvises — and improvisation is what collapses the test-first/impl-second boundary.

When a session goes sideways, Cursor recommends reverting and refining the plan rather than iterating through failed attempts. That advice applies just as cleanly to TDD: when the agent starts writing impl during the red phase, the right move is "stop, rewind to the plan", not "let me prompt my way out".

Fresh conversations between phases

Cursor's own warning: "Long conversations can cause the agent to lose focus. After many turns and summarizations, the context accumulates noise and the agent can get distracted or switch to unrelated tasks."

This is the structural fix for combined-phase TDD. If you write the red test in the same chat that's about to write the impl, the model has the impl plan in its working memory. The test ends up shaped to the impl rather than to the requirement.

Fresh chat per phase breaks that. Your "write the failing test" prompt enters a conversation with no future. The model writes the test as a contract for someone else to satisfy — not as a step in its own plan.

Cursor's docs recommend fresh conversations "when you're moving to a different task or feature" or "when the agent seems confused or keeps making the same mistakes." For TDD: treat each phase as a different task. Red is one task. Green is another. Refactor is a third. Three tasks, three conversations.

`.cursor/rules/` (minimalist, pinned)

Project rules are persistent context — read on every agent invocation. Cursor's docs are emphatic about keeping them minimal: "Start simple. Add rules only when you notice the agent making the same mistake repeatedly."

For TDD, one rule covers it. Drop this in .cursor/rules/tdd.md:

This project follows TDD strictly.

Cycle: write a FAILING test, commit `red(<step>): <message>`, then write
the simplest impl that makes it pass, commit `green(<step>): <message>`.
Optional `refactor: <message>` between cycles.

Never write impl before its failing test. Never write test + impl in the
same response. Never delete a test under any circumstances.

If a test seems wrong, replace it in a separate commit, never bundled
with impl changes.

That's it. Cursor's docs warn against copying entire style guides into rules — the lint should catch style; rules should encode workflow conventions the linter can't enforce. TDD discipline is exactly that kind of convention.

The full workflow

Putting the pieces together, a single TDD cycle in Cursor looks like this:

Setup, once per kata. Open the kata folder, drop the .cursor/rules/tdd.md from above, then Shift+Tab into Plan Mode and prompt:

"We're starting the <kata-name> kata. The requirements are at <spec-url>. Plan the cycles: one red→green pair per requirement, one optional refactor at the end. Don't write code yet."

The plan is a markdown artifact. You read it, push back where you disagree (Cursor explicitly recommends pushback), commit it as a planning note if you like.

Red phase, fresh chat. Exit Plan Mode (or open a new chat with Cmd+L if you've already drifted). Prompt:

"Implement the first cycle of the plan: a failing test for <requirement> in <test-file>. Don't touch the implementation file. Run the test and confirm it fails for the right reason."

Cursor edits, runs, reports the failure. Commit:

git commit -m "red(<step>): <one-line summary>"

Green phase, fresh chat. New conversation. Prompt:

"The test in <test-file> is failing. Write the simplest impl in <impl-file> that makes it pass — no more. Run tests to confirm green."

Commit:

git commit -m "green(<step>): <one-line summary>"

Refactor (optional), fresh chat. Prompt:

"Tests pass. Refactor <impl-file> for clarity without changing behaviour. Run tests after each edit."

Commit:

git commit -m "refactor: <what changed>"

A concrete walk-through

Take the canonical String Calculator kata. Step 1: add("") returns 0.

Plan Mode drafts a plan that says "seven cycles, starting with empty-string returns zero". You commit it.

Red phase, fresh chat. Cursor writes add.test.ts with one failing test, runs bun test, reports the failure. You commit red(empty): empty string returns 0.

Green phase, fresh chat. Cursor opens add.ts, writes export const add = () => 0, runs tests, reports green. You commit green(empty): hardcoded 0.

The "hardcoded 0" is intentional: Beck's "fake it 'til you make it". The next step (add("42") returns 42) will force a real implementation, because returning 0 will fail the new test. The discipline emerges step-by-step; you can't shortcut it without inheriting design debt you didn't choose.

Common pitfalls and what they cost

red-did-not-fail — combined Composer turn writes test + impl in one apply. Test passes immediately. -5. Fix: fresh chat per phase.

hidden-tests-failed — your test passes, but the kata's hidden tests don't. The test was tautological or shaped around the wrong impl. 0 points. Fix: anchor the test to the requirement, not to the implementation you're about to write.

test-deleted — Cursor offers to "fix" a failing green by removing the test instead of fixing the impl. -20. Fix: the rule above ("never delete a test under any circumstances") + push back when Cursor proposes it. Cursor's docs explicitly recommend treating the agent as a "capable collaborator" — pushback is part of the relationship.

Broken refactor — refactor commit's tests fail. -5. Fix: re-run after every edit; revert if anything breaks.

How tdd.md verifies you actually followed the pattern

tdd.md clones your kata repo on push, walks each commit, and runs the tests in a sandbox at every checkout. For the red commit, it asserts the tests fail. For the green commit, it asserts they pass. It also runs hidden tests it owns (catches tautologies), counts test functions across commits (catches deletion), and re-runs tests on each refactor (catches regression).

The verdict is a public URL with per-step status, score, and a one-line explanation per row. /demo/string-calc is what a clean run looks like — +45, two steps verified, one refactor with tests staying green.

If you ran the kata in Cursor with everything documented above, your verdict matches the demo's pattern. If you slipped — combined Composer, long-running session that bled context, agent mode that auto-applied across the boundary — the verdict shows exactly which step it caught you on, and why.

Why Plan Mode is the secret weapon

The structural insight Cursor's docs nudge at but don't quite spell out: Plan Mode is where the agent's plan and the kata's requirements line up before any code touches disk. The plan is a markdown artifact. It survives across session resets. It's what makes fresh-conversation-per-phase coherent — each phase enters a fresh context, but with the same plan visible.

Without a plan, fresh conversations are amnesia. With a plan, they're focus. That's the whole shift between "TDD-shaped output" and "actual TDD discipline".

Try it

Sign in at tdd.md/you, pick the string-calc kata, and run it through Cursor with the rules and the workflow above. The verdict updates within seconds of each push. The phase log shows what the judge saw, the score column shows what each commit earned, and the explanation column tells you why.

Six steps in, you'll have an evidence-backed answer to: "Is my Cursor workflow doing real TDD, or just looking like it?"

← all guides · Cursor reference guide · the kata catalog · the Claude Code post →