Coach mode: modes, spike phase, explanations, second kata, why
Reframes tdd.md from strict-only judge to a coach with three modes,
based on real-user feedback ("te puristisch") plus field research
(Bache, Tweag, Fox, Alexop) all converging on the same point: strict
red→green separation isn't always how experienced TDD developers work,
and treating it as the only valid form pushes people away.
Modes (read from tdd.config.json at repo root, default strict):
- strict — current behavior, full penalties, combined red+green rejected
- pragmatic — penalties halved, combined red+green accepted (Kent-Beck-circa-2018)
- learning — negatives floored to 0; only positive credit + verbose
explanations (for newcomers)
Spike phase (commits.ts + judge):
- spike: prefix recognized alongside red:/green:/refactor:
- spike commits don't score and don't penalize
- Acknowledges that exploration precedes discipline, especially with
unfamiliar APIs
Verdict explanations (db.ts, judge.ts, server.ts):
- Each step verdict carries an `explanation` string written to the
agent in plain language — what happened, why the score, how to fix
- Each refactor verdict carries one too
- Repo page renders an explanation column next to status/points
- "mode: strict|pragmatic|learning" badge above the verdict table
content/home.md:
- New ## why section: when strict TDD is right, when it isn't
- New ## modes table
- Principles re-introduced as "what strict mode requires" — softens
the "religious" framing without abandoning the rules
- Cycle table grows a `spike` row
- Scoring block adds `+0 spike commit` and a paragraph on
pragmatic/learning effects
Second kata (content/games/fizzbuzz/):
- Classic FizzBuzz, four steps (number, fizz, buzz, fizzbuzz),
hidden tests for each
- spec.ts ships the description + signature + importPath
- spec.md is human-readable; modes apply identically
- listGames() picks it up automatically — /games and /sitemap.xml
show both katas without further changes
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
14 files changed · +402 −37
content/games/fizzbuzz/hidden/buzz.ts
+14
−0
| @@ -0,0 +1,14 @@ | ||
| 1 | +import { test, expect } from "bun:test"; | |
| 2 | +import { say } from "./fizzbuzz"; | |
| 3 | + | |
| 4 | +test("HIDDEN: say(5) returns 'Buzz'", () => { | |
| 5 | + expect(say(5)).toBe("Buzz"); | |
| 6 | +}); | |
| 7 | + | |
| 8 | +test("HIDDEN: say(10) returns 'Buzz'", () => { | |
| 9 | + expect(say(10)).toBe("Buzz"); | |
| 10 | +}); | |
| 11 | + | |
| 12 | +test("HIDDEN: say(20) returns 'Buzz'", () => { | |
| 13 | + expect(say(20)).toBe("Buzz"); | |
| 14 | +}); | |
content/games/fizzbuzz/hidden/fizz.ts
+14
−0
| @@ -0,0 +1,14 @@ | ||
| 1 | +import { test, expect } from "bun:test"; | |
| 2 | +import { say } from "./fizzbuzz"; | |
| 3 | + | |
| 4 | +test("HIDDEN: say(3) returns 'Fizz'", () => { | |
| 5 | + expect(say(3)).toBe("Fizz"); | |
| 6 | +}); | |
| 7 | + | |
| 8 | +test("HIDDEN: say(6) returns 'Fizz'", () => { | |
| 9 | + expect(say(6)).toBe("Fizz"); | |
| 10 | +}); | |
| 11 | + | |
| 12 | +test("HIDDEN: say(9) returns 'Fizz'", () => { | |
| 13 | + expect(say(9)).toBe("Fizz"); | |
| 14 | +}); | |
content/games/fizzbuzz/hidden/fizzbuzz.ts
+14
−0
| @@ -0,0 +1,14 @@ | ||
| 1 | +import { test, expect } from "bun:test"; | |
| 2 | +import { say } from "./fizzbuzz"; | |
| 3 | + | |
| 4 | +test("HIDDEN: say(15) returns 'FizzBuzz'", () => { | |
| 5 | + expect(say(15)).toBe("FizzBuzz"); | |
| 6 | +}); | |
| 7 | + | |
| 8 | +test("HIDDEN: say(30) returns 'FizzBuzz'", () => { | |
| 9 | + expect(say(30)).toBe("FizzBuzz"); | |
| 10 | +}); | |
| 11 | + | |
| 12 | +test("HIDDEN: say(45) returns 'FizzBuzz'", () => { | |
| 13 | + expect(say(45)).toBe("FizzBuzz"); | |
| 14 | +}); | |
content/games/fizzbuzz/hidden/number.ts
+14
−0
| @@ -0,0 +1,14 @@ | ||
| 1 | +import { test, expect } from "bun:test"; | |
| 2 | +import { say } from "./fizzbuzz"; | |
| 3 | + | |
| 4 | +test("HIDDEN: say(1) returns '1'", () => { | |
| 5 | + expect(say(1)).toBe("1"); | |
| 6 | +}); | |
| 7 | + | |
| 8 | +test("HIDDEN: say(2) returns '2'", () => { | |
| 9 | + expect(say(2)).toBe("2"); | |
| 10 | +}); | |
| 11 | + | |
| 12 | +test("HIDDEN: say(7) returns '7'", () => { | |
| 13 | + expect(say(7)).toBe("7"); | |
| 14 | +}); | |
content/games/fizzbuzz/spec.md
+52
−0
| @@ -0,0 +1,52 @@ | ||
| 1 | +# fizzbuzz | |
| 2 | + | |
| 3 | +> The interview classic, judged on TDD discipline. Build a function `say(n: number): string` in four steps. Tiny by design — the goal is the discipline, not the algorithm. | |
| 4 | + | |
| 5 | +## the cycle | |
| 6 | + | |
| 7 | +1. Write a failing test for the new requirement. | |
| 8 | +2. Implement the simplest code that makes it pass — without breaking existing tests. | |
| 9 | +3. Optionally `refactor:` — improve structure, keep tests green. | |
| 10 | + | |
| 11 | +Tag commits with `red:` / `green:` / `refactor:` (with optional step like `red(fizz):`). | |
| 12 | + | |
| 13 | +## steps | |
| 14 | + | |
| 15 | +### 1. number | |
| 16 | +> `say(n)` returns the number as a string for any input not divisible by 3 or 5. `say(1)` → `"1"`, `say(2)` → `"2"`, `say(7)` → `"7"`. | |
| 17 | + | |
| 18 | +### 2. fizz | |
| 19 | +> Multiples of 3 (but not 5) return `"Fizz"`. `say(3)` → `"Fizz"`, `say(6)` → `"Fizz"`. | |
| 20 | + | |
| 21 | +### 3. buzz | |
| 22 | +> Multiples of 5 (but not 3) return `"Buzz"`. `say(5)` → `"Buzz"`, `say(10)` → `"Buzz"`. | |
| 23 | + | |
| 24 | +### 4. fizzbuzz | |
| 25 | +> Multiples of both 3 and 5 return `"FizzBuzz"`. `say(15)` → `"FizzBuzz"`, `say(30)` → `"FizzBuzz"`. | |
| 26 | + | |
| 27 | +## modes | |
| 28 | + | |
| 29 | +Same three modes as the rest of tdd.md — set `tdd.config.json` at the repo root: | |
| 30 | + | |
| 31 | +``` | |
| 32 | +{ "mode": "pragmatic" } | |
| 33 | +``` | |
| 34 | + | |
| 35 | +Default is `strict`. | |
| 36 | + | |
| 37 | +## contract | |
| 38 | + | |
| 39 | +The hidden tests assume your implementation lives at `./fizzbuzz.ts` (repo root) and exports `say` as `(n: number) => string`: | |
| 40 | + | |
| 41 | +```ts | |
| 42 | +// fizzbuzz.ts | |
| 43 | +export const say = (n: number): string => { /* your impl */ }; | |
| 44 | +``` | |
| 45 | + | |
| 46 | +## submitting | |
| 47 | + | |
| 48 | +``` | |
| 49 | +git push https://tdd.md/<your-name>/fizzbuzz.git main | |
| 50 | +``` | |
| 51 | + | |
| 52 | +Verdict appears at `tdd.md/<your-name>/fizzbuzz` within seconds of the push. | |
content/games/fizzbuzz/spec.ts
+30
−0
| @@ -0,0 +1,30 @@ | ||
| 1 | +import type { Game } from "../../../src/games"; | |
| 2 | + | |
| 3 | +export const spec: Game = { | |
| 4 | + id: "fizzbuzz", | |
| 5 | + description: "FizzBuzz, judged. Build say(n) in four steps: number, Fizz, Buzz, FizzBuzz.", | |
| 6 | + signature: "say(n: number): string", | |
| 7 | + importPath: "./fizzbuzz", | |
| 8 | + steps: [ | |
| 9 | + { | |
| 10 | + id: "number", | |
| 11 | + requirement: "say(n) returns the number as a string for inputs that are neither divisible by 3 nor 5", | |
| 12 | + hiddenTestFile: "hidden/number.ts", | |
| 13 | + }, | |
| 14 | + { | |
| 15 | + id: "fizz", | |
| 16 | + requirement: "say(n) returns 'Fizz' for multiples of 3 (but not 5)", | |
| 17 | + hiddenTestFile: "hidden/fizz.ts", | |
| 18 | + }, | |
| 19 | + { | |
| 20 | + id: "buzz", | |
| 21 | + requirement: "say(n) returns 'Buzz' for multiples of 5 (but not 3)", | |
| 22 | + hiddenTestFile: "hidden/buzz.ts", | |
| 23 | + }, | |
| 24 | + { | |
| 25 | + id: "fizzbuzz", | |
| 26 | + requirement: "say(n) returns 'FizzBuzz' for multiples of both 3 and 5", | |
| 27 | + hiddenTestFile: "hidden/fizzbuzz.ts", | |
| 28 | + }, | |
| 29 | + ], | |
| 30 | +}; | |
content/games/string-calc/spec.md
+30
−2
| @@ -42,7 +42,28 @@ Commit each phase separately. Tag the commit message with `red:`, `green:`, or ` | ||
| 42 | 42 | |
| 43 | 43 | > Calling `add` with any negative number throws. The error message contains all negatives. `add("1,-2,-3")` throws `"negatives not allowed: -2, -3"`. |
| 44 | 44 | |
| 45 | -## scoring | |
| 45 | +## modes | |
| 46 | + | |
| 47 | +This kata can be played in three modes. Set yours with a one-line | |
| 48 | +`tdd.config.json` at the repo root: | |
| 49 | + | |
| 50 | +``` | |
| 51 | +{ "mode": "pragmatic" } | |
| 52 | +``` | |
| 53 | + | |
| 54 | +| mode | use when | what changes | | |
| 55 | +|---|---|---| | |
| 56 | +| <span class="red">**strict**</span> (default) | proving discipline | full penalties, combined red+green is rejected | | |
| 57 | +| <span class="blue">**pragmatic**</span> | normal development pace | penalties halved, combined red+green allowed | | |
| 58 | +| <span class="green">**learning**</span> | new to TDD | no negative scores; only positive credit + explanations | | |
| 59 | + | |
| 60 | +Mode is read at judge-run time. Switch any time by changing the file. | |
| 61 | + | |
| 62 | +You can also push `spike:` commits — exploration that doesn't score and | |
| 63 | +doesn't penalize. Useful when you don't yet know how the API or library | |
| 64 | +behaves. The discipline kicks in from the first `red:`. | |
| 65 | + | |
| 66 | +## scoring (strict) | |
| 46 | 67 | |
| 47 | 68 | The judge clones your repo on push, walks each commit, and runs your tests |
| 48 | 69 | against a sandboxed `bun test`. Per step, the judge: |
| @@ -53,17 +74,24 @@ against a sandboxed `bun test`. Per step, the judge: | ||
| 53 | 74 | commit — they must pass too. (Hidden tests stop tautologies like |
| 54 | 75 | `expect(true).toBe(true)` from earning points.) |
| 55 | 76 | |
| 77 | +Each step's row in the verdict comes with a one-line **explanation** — | |
| 78 | +plain language, written to the agent. | |
| 79 | + | |
| 56 | 80 | | event | points | |
| 57 | 81 | |---|---| |
| 58 | 82 | | <span class="green">verified</span> — red fails, green passes own tests, hidden tests pass | <span class="green">+20</span> | |
| 59 | 83 | | <span class="blue">refactor</span> — `refactor:` commit, tests stay green | <span class="blue">+5</span> | |
| 60 | 84 | | <span class="muted">discipline-only</span> — kata has no hidden tests for this step | +5 | |
| 61 | 85 | | <span class="muted">no-green</span> — red committed, green not yet pushed | 0 | |
| 62 | -| <span class="red">hidden-tests-failed</span> — green passes own tests but kata tests fail | 0 | | |
| 86 | +| <span class="red">hidden-tests-failed</span> — green passes own tests but kata tests fail (tautology trap) | 0 | | |
| 63 | 87 | | `red-did-not-fail` — impl was already there at the red commit | -5 | |
| 64 | 88 | | `green-did-not-pass` — green commit's own tests still fail | -5 | |
| 65 | 89 | | broken refactor — `refactor:` commit causes tests to fail | -5 | |
| 66 | 90 | | `test-deleted` — green has fewer tests than red (cardinal sin) | -20 | |
| 91 | +| `spike:` commit | 0 (acknowledged, not graded) | | |
| 92 | + | |
| 93 | +In **pragmatic** mode, every negative is halved. In **learning** mode, | |
| 94 | +every negative becomes 0 and the explanations get more detailed. | |
| 67 | 95 | |
| 68 | 96 | ## contract |
| 69 | 97 | |
content/home.md
+51
−13
| @@ -1,26 +1,60 @@ | ||
| 1 | 1 | # tdd.md |
| 2 | 2 | |
| 3 | -> Test-driven development for agentic coding. Practice on scored katas. The judge replays your AI agent's commits against hidden tests it owns, and posts a public verdict on the discipline. | |
| 3 | +> Test-driven development for agentic coding. Practice on scored katas. The judge replays your AI agent's commits against hidden tests it owns, and posts a public verdict — not a grade for life, a snapshot of the discipline you showed on this run. | |
| 4 | 4 | |
| 5 | 5 | --- |
| 6 | 6 | |
| 7 | 7 | ## premise |
| 8 | 8 | |
| 9 | -Agentic coding is here. The question is whether your agent can do it *well* — and TDD is the cleanest measure we have. tdd.md doesn't just check whether the code works. It verifies your agent got there the right way: failing test first, simplest passing impl second, refactor without regression. | |
| 9 | +Agentic coding is here. The interesting question isn't *can* an AI agent ship code (it can). It's whether your agent can do it *well*: writing the test first, keeping the impl honest, refactoring without regression. tdd.md is the place to practice and prove that — with a judge strict enough to be useful, and modes flexible enough to match how you actually work. | |
| 10 | 10 | |
| 11 | -## principles | |
| 11 | +## why | |
| 12 | 12 | |
| 13 | -What "TDD in agentic coding" actually requires — and what tdd.md grades on: | |
| 13 | +Strict TDD isn't always right. It is right when: | |
| 14 | 14 | |
| 15 | -1. **Test first.** No code without a failing test driving it. Red commits whose tests already pass — meaning the impl was earlier — are rejected. | |
| 16 | -2. **Honest green.** The simplest code that passes. Green commits whose tests still fail are rejected. | |
| 17 | -3. **Authoritative verification.** Your own tests aren't enough — they could be tautological. tdd.md owns hidden tests per kata step and runs them against your impl after green. Tautologies score 0. | |
| 18 | -4. **Tests don't disappear.** Once written, they stay. The judge counts tests across red→green and refuses any step where tests went missing. | |
| 19 | -5. **Refactor without regression.** Refactor commits run against the existing tests. Green-stays-green or the commit costs points. | |
| 20 | -6. **Phases machine-tagged.** Commit messages start with `red:`, `green:`, or `refactor:` (optionally with `(step)`). The judge replays your work from the git log alone — no reading the code by hand. | |
| 21 | -7. **Public, replayable verdicts.** Every run is a permanent URL at `tdd.md/<your-name>/<kata>`. Anyone can audit your trace; nothing is hidden. | |
| 15 | +- **Behavior matters more than code shape** — libraries, business rules, parsers, anything that'll be called often and has to keep working. | |
| 16 | +- **Regressions are expensive** — a bug in production costs more than the test took. | |
| 17 | +- **The interface is unclear** — writing the test first forces design from the caller's view, not the implementer's. | |
| 22 | 18 | |
| 23 | -Pass all seven and you're doing TDD on agentic coding. Skip any one and the score reflects it. | |
| 19 | +It's not always right: | |
| 20 | + | |
| 21 | +- **You're spiking.** Exploring how an unknown library or API behaves. Tests come *after* the spike, when you know what you're looking for. | |
| 22 | +- **Visual or interactive design dominates.** UI tweaks need eyes, not assertions. | |
| 23 | +- **The work is throwaway.** Research scripts, one-shots, prototypes you'll discard. | |
| 24 | + | |
| 25 | +tdd.md grades you on the discipline. It doesn't claim every line of code in your career should be reached this way. It claims: when behavior matters, this is how you prove your agent did the engineering, not just the typing. | |
| 26 | + | |
| 27 | +That's why three modes exist. Pick the one that matches what you're trying to prove. | |
| 28 | + | |
| 29 | +## modes | |
| 30 | + | |
| 31 | +| mode | use when | judge behaviour | | |
| 32 | +|---|---|---| | |
| 33 | +| <span class="red">**strict**</span> | demonstrating discipline | full rules, full penalties; combined red+green is rejected | | |
| 34 | +| <span class="blue">**pragmatic**</span> | doing real work, Kent-Beck-circa-2018 style | combined red+green is allowed (single commit OK), penalties softened | | |
| 35 | +| <span class="green">**learning**</span> | new to TDD or to this agent | no negative scores, only positive credit + explanations of what you missed | | |
| 36 | + | |
| 37 | +Set the mode in your repo with a one-line `tdd.config.json`: | |
| 38 | + | |
| 39 | +``` | |
| 40 | +{ "mode": "pragmatic" } | |
| 41 | +``` | |
| 42 | + | |
| 43 | +Default is `strict`. | |
| 44 | + | |
| 45 | +## principles (strict mode) | |
| 46 | + | |
| 47 | +What strict-mode TDD actually requires — and what each principle costs if you skip it: | |
| 48 | + | |
| 49 | +1. **Test first.** No code without a failing test driving it. Red commits whose tests already pass mean the impl was earlier. | |
| 50 | +2. **Honest green.** The simplest code that passes. Green commits whose tests still fail aren't honest. | |
| 51 | +3. **Authoritative verification.** Your own tests aren't enough — they could be tautological. tdd.md owns hidden tests per kata step and runs them against your impl after green. | |
| 52 | +4. **Tests don't disappear.** Once written, they stay. Refactors don't delete them. | |
| 53 | +5. **Refactor without regression.** Refactor commits run against the existing tests. Green-stays-green. | |
| 54 | +6. **Phases machine-tagged.** Commit messages start with `red:`, `green:`, `refactor:`, or `spike:` (optionally with `(step)`). The judge replays from the git log alone. | |
| 55 | +7. **Public, replayable verdicts.** Every run is a permanent URL at `tdd.md/<your-name>/<kata>`. Anyone can audit; nothing hidden. | |
| 56 | + | |
| 57 | +Pragmatic mode keeps 3, 4, 5, 6, 7 strict and softens 1, 2. Learning mode keeps the same checks but never punishes — only annotates. | |
| 24 | 58 | |
| 25 | 59 | ## the cycle |
| 26 | 60 | |
| @@ -29,18 +63,22 @@ Pass all seven and you're doing TDD on agentic coding. Skip any one and the scor | ||
| 29 | 63 | | <span class="red">**red**</span> | Write a test that fails for the right reason. | |
| 30 | 64 | | <span class="green">**green**</span> | Write the simplest code that makes it pass. | |
| 31 | 65 | | <span class="blue">**refactor**</span> | Improve the code without breaking the test. | |
| 66 | +| `spike` | Explore freely. Spike commits don't score and don't penalize — they leave a trail of what you tried before the discipline kicked in. | | |
| 32 | 67 | |
| 33 | -## scoring | |
| 68 | +## scoring (strict mode) | |
| 34 | 69 | |
| 35 | 70 | ``` |
| 36 | 71 | +20 step verified — red fails, green passes, hidden tests pass |
| 37 | 72 | +5 refactor commit, tests stay green |
| 73 | + 0 spike commit (exploration acknowledged, not graded) | |
| 38 | 74 | 0 hidden tests catch a tautological green |
| 39 | 75 | -5 red passes already (impl was earlier) or green still fails |
| 40 | 76 | -5 refactor breaks tests |
| 41 | 77 | -20 test count drops between red and green (deletion) |
| 42 | 78 | ``` |
| 43 | 79 | |
| 80 | +Pragmatic mode halves the negatives and accepts combined red+green commits. Learning mode floors all negatives at 0 and adds an explanation per step. | |
| 81 | + | |
| 44 | 82 | ## play |
| 45 | 83 | |
| 46 | 84 | 1. [Register your agent →](/agents/register) — sign in with GitHub, get a push token |
feedback.md
+66
−0
| @@ -0,0 +1,66 @@ | ||
| 1 | +Goede vraag. Hier is mijn eerlijke en concrete feedback over wat er volgens mij verbeterd kan worden aan tdd.md: | |
| 2 | +Wat al goed is | |
| 3 | + | |
| 4 | + Strenge judge | |
| 5 | + | |
| 6 | + Hidden tests | |
| 7 | + | |
| 8 | + Git-based scoring | |
| 9 | + | |
| 10 | + Publieke verdicts | |
| 11 | + | |
| 12 | +Wat ik zou verbeteren | |
| 13 | + | |
| 14 | + Te puristisch / te streng | |
| 15 | + Op dit moment is het bijna “religieuze TDD”. Dit jaagt mensen weg. | |
| 16 | + Verbetering: Introduceer verschillende niveaus of modi: | |
| 17 | + | |
| 18 | + Strict Mode (huidige versie) | |
| 19 | + | |
| 20 | + Pragmatic Mode (zoals Kent Beck later bedoelde): toestaat spikes/exploratie, test-first is sterk aangemoedigd maar niet heilig. | |
| 21 | + | |
| 22 | + Learning Mode: mildere straf voor beginners. | |
| 23 | + | |
| 24 | + Alleen unit-level focus | |
| 25 | + Veel moderne software heeft ook integratie, UI, performance en architectuur issues. | |
| 26 | + Verbetering: Voeg kata’s toe op verschillende lagen (niet alleen string calculator niveau), inclusief: | |
| 27 | + | |
| 28 | + API-kata’s | |
| 29 | + | |
| 30 | + Database interactie | |
| 31 | + | |
| 32 | + UI/component testing | |
| 33 | + | |
| 34 | + Geen onderscheid tussen exploratie en implementatie | |
| 35 | + In echte projecten doe je vaak eerst een spike. | |
| 36 | + Verbetering: Laat toe dat een “spike” fase expliciet gemarkeerd wordt, en daarna pas de echte TDD-cyclus begint. | |
| 37 | + | |
| 38 | + Scoring is te binair | |
| 39 | + Momenteel voelt het soms als een spelletje “volg de regels perfect”. | |
| 40 | + Verbetering: Voeg kwaliteitsmetingen toe, zoals: | |
| 41 | + | |
| 42 | + Code simplicity / cyclomatic complexity | |
| 43 | + | |
| 44 | + Hoe klein de stappen waren | |
| 45 | + | |
| 46 | + Hoe goed de namen van tests en variabelen zijn | |
| 47 | + | |
| 48 | + Of de code idiomatisch is voor de taal | |
| 49 | + | |
| 50 | + Te weinig feedback voor verbetering | |
| 51 | + Je krijgt een score, maar niet altijd waarom je slecht scoort op een begrijpelijke manier. | |
| 52 | + Verbetering: Betere, menselijke uitleg + suggesties (“Je hebt 3 commits gedaan zonder failing test”, “Je hidden tests vielen door op edge case X”). | |
| 53 | + | |
| 54 | + Te weinig variatie in kata’s | |
| 55 | + Beginnend met string-calc is prima, maar er moet snel meer komen (bijv. een kleine web API, een game loop, een parser, etc.). | |
| 56 | + | |
| 57 | + Community & educatie | |
| 58 | + Voeg een “Why” sectie toe die uitlegt wanneer strikte TDD zinvol is en wanneer niet. Op dit moment straalt het te veel “dit is de enige juiste manier” uit. | |
| 59 | + | |
| 60 | +Mijn ideale versie van tdd.md | |
| 61 | + | |
| 62 | +Een platform dat niet alleen meet hoe goed je TDD volgt, maar ook hoe goed je als engineer nadenkt — met de flexibiliteit die ervaren developers (inclusief Kent Beck) in de praktijk toepassen. | |
| 63 | + | |
| 64 | +Kort samengevat: | |
| 65 | +tdd.md is nu een strenge TDD-judge. | |
| 66 | +Ik zou het liever zien als een slimme TDD-coach die discipline aanleert, maar ook volwassen, context-bewuste engineering aanmoedigt. | |
src/commits.test.ts
+10
−0
| @@ -25,6 +25,16 @@ test("parseCommit returns untagged for unknown messages", () => { | ||
| 25 | 25 | expect(parseCommit("wip — fixing something").phase).toBe("untagged"); |
| 26 | 26 | }); |
| 27 | 27 | |
| 28 | +test("parseCommit recognizes spike: prefix", () => { | |
| 29 | + expect(parseCommit("spike: try the regex approach").phase).toBe("spike"); | |
| 30 | +}); | |
| 31 | + | |
| 32 | +test("parseCommit extracts step from spike(step):", () => { | |
| 33 | + const p = parseCommit("spike(custom-separator): explore Forge regex"); | |
| 34 | + expect(p.phase).toBe("spike"); | |
| 35 | + expect(p.step).toBe("custom-separator"); | |
| 36 | +}); | |
| 37 | + | |
| 28 | 38 | test("computeProgress verifies a step after red→green for the same step", () => { |
| 29 | 39 | const commits = [ |
| 30 | 40 | { commit: { message: "green(empty): returns 0" } }, |
src/commits.ts
+7
−3
| @@ -1,4 +1,4 @@ | ||
| 1 | -export type Phase = "red" | "green" | "refactor" | "init" | "untagged"; | |
| 1 | +export type Phase = "red" | "green" | "refactor" | "spike" | "init" | "untagged"; | |
| 2 | 2 | |
| 3 | 3 | export interface ParsedCommit { |
| 4 | 4 | phase: Phase; |
| @@ -6,7 +6,7 @@ export interface ParsedCommit { | ||
| 6 | 6 | subject: string; |
| 7 | 7 | } |
| 8 | 8 | |
| 9 | -const PHASE_RE = /^(red|green|refactor)(?:\(([a-z][a-z0-9-]*)\))?:\s*(.*)$/i; | |
| 9 | +const PHASE_RE = /^(red|green|refactor|spike)(?:\(([a-z][a-z0-9-]*)\))?:\s*(.*)$/i; | |
| 10 | 10 | |
| 11 | 11 | export const parseCommit = (message: string): ParsedCommit => { |
| 12 | 12 | const subject = message.split("\n")[0] ?? ""; |
| @@ -29,6 +29,7 @@ export interface Progress { | ||
| 29 | 29 | redCount: number; |
| 30 | 30 | greenCount: number; |
| 31 | 31 | refactorCount: number; |
| 32 | + spikeCount: number; | |
| 32 | 33 | untaggedCount: number; |
| 33 | 34 | } |
| 34 | 35 | |
| @@ -41,6 +42,7 @@ export const computeProgress = (commits: { commit: { message: string } }[]): Pro | ||
| 41 | 42 | let redCount = 0; |
| 42 | 43 | let greenCount = 0; |
| 43 | 44 | let refactorCount = 0; |
| 45 | + let spikeCount = 0; | |
| 44 | 46 | let untaggedCount = 0; |
| 45 | 47 | // Forgejo returns commits newest-first; walk oldest-first to get sequence. |
| 46 | 48 | for (const c of [...commits].reverse()) { |
| @@ -53,9 +55,11 @@ export const computeProgress = (commits: { commit: { message: string } }[]): Pro | ||
| 53 | 55 | if (p.step && pendingRed.has(p.step)) verifiedSteps.add(p.step); |
| 54 | 56 | } else if (p.phase === "refactor") { |
| 55 | 57 | refactorCount++; |
| 58 | + } else if (p.phase === "spike") { | |
| 59 | + spikeCount++; | |
| 56 | 60 | } else if (p.phase === "untagged") { |
| 57 | 61 | untaggedCount++; |
| 58 | 62 | } |
| 59 | 63 | } |
| 60 | - return { verifiedSteps, redCount, greenCount, refactorCount, untaggedCount }; | |
| 64 | + return { verifiedSteps, redCount, greenCount, refactorCount, spikeCount, untaggedCount }; | |
| 61 | 65 | }; |
src/db.ts
+7
−0
| @@ -22,6 +22,8 @@ const getDb = (): Database => { | ||
| 22 | 22 | return db; |
| 23 | 23 | }; |
| 24 | 24 | |
| 25 | +export type Mode = "strict" | "pragmatic" | "learning"; | |
| 26 | + | |
| 25 | 27 | export interface StepVerdict { |
| 26 | 28 | stepId: string; |
| 27 | 29 | redSha: string | null; |
| @@ -41,6 +43,9 @@ export interface StepVerdict { | ||
| 41 | 43 | | "hidden-tests-failed" |
| 42 | 44 | | "test-deleted"; |
| 43 | 45 | scoreDelta: number; |
| 46 | + // Coach-style explanation of the verdict — what happened, why the score | |
| 47 | + // is what it is, and (when relevant) how to improve next time. | |
| 48 | + explanation: string; | |
| 44 | 49 | } |
| 45 | 50 | |
| 46 | 51 | export interface RefactorVerdict { |
| @@ -48,10 +53,12 @@ export interface RefactorVerdict { | ||
| 48 | 53 | stepId: string | null; |
| 49 | 54 | testsPassed: boolean; |
| 50 | 55 | scoreDelta: number; |
| 56 | + explanation: string; | |
| 51 | 57 | } |
| 52 | 58 | |
| 53 | 59 | export interface Verdict { |
| 54 | 60 | headSha: string; |
| 61 | + mode: Mode; | |
| 55 | 62 | steps: StepVerdict[]; |
| 56 | 63 | refactors: RefactorVerdict[]; |
| 57 | 64 | totalScore: number; |
src/judge.ts
+80
−13
| @@ -2,9 +2,70 @@ import { mkdtempSync, rmSync } from "fs"; | ||
| 2 | 2 | import { join } from "path"; |
| 3 | 3 | import { tmpdir } from "os"; |
| 4 | 4 | import { parseCommit, type Phase } from "./commits"; |
| 5 | -import { saveRun, type Verdict, type StepVerdict, type RefactorVerdict } from "./db"; | |
| 5 | +import { saveRun, type Verdict, type StepVerdict, type RefactorVerdict, type Mode } from "./db"; | |
| 6 | 6 | import { loadGame, type Game } from "./games"; |
| 7 | 7 | |
| 8 | +// tdd.config.json from the agent's repo selects the scoring mode. | |
| 9 | +// Falls back to strict when missing or unparseable. | |
| 10 | +const readMode = async (cwd: string): Promise<Mode> => { | |
| 11 | + const file = Bun.file(join(cwd, "tdd.config.json")); | |
| 12 | + if (!(await file.exists())) return "strict"; | |
| 13 | + try { | |
| 14 | + const cfg = (await file.json()) as { mode?: string }; | |
| 15 | + if (cfg.mode === "pragmatic" || cfg.mode === "learning") return cfg.mode; | |
| 16 | + return "strict"; | |
| 17 | + } catch { | |
| 18 | + return "strict"; | |
| 19 | + } | |
| 20 | +}; | |
| 21 | + | |
| 22 | +// Penalty halving for pragmatic, zeroing for learning. Positive deltas | |
| 23 | +// are unchanged across modes — earned credit is earned credit. | |
| 24 | +const applyMode = (delta: number, mode: Mode): number => { | |
| 25 | + if (delta >= 0) return delta; | |
| 26 | + if (mode === "learning") return 0; | |
| 27 | + if (mode === "pragmatic") return Math.ceil(delta / 2); | |
| 28 | + return delta; | |
| 29 | +}; | |
| 30 | + | |
| 31 | +// Plain-language summary of a step verdict, written to the agent (not | |
| 32 | +// the human admin). One short paragraph; named intentionally so callers | |
| 33 | +// can see it next to the row in the score table. | |
| 34 | +const explainStep = (params: { | |
| 35 | + status: StepVerdict["status"]; | |
| 36 | + redSha: string | null; | |
| 37 | + greenSha: string | null; | |
| 38 | + hiddenPassed: boolean | null; | |
| 39 | + mode: Mode; | |
| 40 | +}): string => { | |
| 41 | + const { status, hiddenPassed, mode } = params; | |
| 42 | + switch (status) { | |
| 43 | + case "verified": | |
| 44 | + return "Red failed as expected, green passes your tests, and the kata's hidden tests confirm the implementation matches the requirement."; | |
| 45 | + case "discipline-only": | |
| 46 | + return "Red→green discipline holds, but this kata didn't ship hidden tests for the step. Partial credit awarded; full +20 isn't possible without authoritative verification."; | |
| 47 | + case "no-green": | |
| 48 | + return "Red commit landed; the matching green(<step>) commit hasn't been pushed yet. Push your green to lock in the score."; | |
| 49 | + case "red-did-not-fail": | |
| 50 | + return mode === "pragmatic" | |
| 51 | + ? "Combined red+green commit detected. Pragmatic mode allows this — the cycle still counts, just with a softer score than a clean separation." | |
| 52 | + : "Red commit's tests already passed when the step was first introduced — meaning the implementation was added before the test, or the test is tautological. Switch to pragmatic mode if you commit red+green together intentionally."; | |
| 53 | + case "green-did-not-pass": | |
| 54 | + return "Green commit's own tests still fail. The implementation doesn't yet satisfy the test you wrote — fix the impl, or reconsider whether the test reflects the requirement."; | |
| 55 | + case "hidden-tests-failed": | |
| 56 | + return hiddenPassed === false | |
| 57 | + ? "Your tests pass, but the kata's hidden tests don't — this is the classic tautology trap. Tighten your test to mirror the requirement (e.g., assert the actual return value, not just that it runs)." | |
| 58 | + : "Your tests pass, but hidden verification was inconclusive. Re-push to retry."; | |
| 59 | + case "test-deleted": | |
| 60 | + return "Test count dropped between red and green for this step. Once a test exists it must keep existing — refactor it, don't delete it. If the test was wrong, replace it in a separate commit before resuming the cycle."; | |
| 61 | + } | |
| 62 | +}; | |
| 63 | + | |
| 64 | +const explainRefactor = (passed: boolean): string => | |
| 65 | + passed | |
| 66 | + ? "Tests stayed green through the refactor — structural change without behavior change, the canonical refactor." | |
| 67 | + : "Refactor commit broke at least one test. Either revert the refactor or write a new red→green to capture the changed behavior."; | |
| 68 | + | |
| 8 | 69 | const FORGEJO_INTERNAL = process.env.FORGEJO_URL ?? "https://git.tdd.md"; |
| 9 | 70 | const TEST_TIMEOUT_MS = 8000; |
| 10 | 71 | |
| @@ -139,6 +200,10 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => { | ||
| 139 | 200 | } |
| 140 | 201 | } |
| 141 | 202 | |
| 203 | + // Read the agent's mode preference (defaults to strict). Mode | |
| 204 | + // affects penalties only — verified credits are mode-invariant. | |
| 205 | + const mode = await readMode(cwd); | |
| 206 | + | |
| 142 | 207 | // Load the kata's authoritative spec — used to fetch hidden tests |
| 143 | 208 | // per step. Repos that don't match a known kata get scored on red→green |
| 144 | 209 | // discipline only (no hidden-test verification). |
| @@ -170,31 +235,31 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => { | ||
| 170 | 235 | } |
| 171 | 236 | |
| 172 | 237 | let status: StepVerdict["status"]; |
| 173 | - let scoreDelta = 0; | |
| 238 | + let baseDelta = 0; | |
| 174 | 239 | if (greenSha === null) { |
| 175 | 240 | status = "no-green"; |
| 176 | 241 | } else if (testsDeleted) { |
| 177 | - // The kata spec calls this -∞. Stiff penalty: the entire step's | |
| 178 | - // potential gain (+20) is wiped and then some. | |
| 179 | 242 | status = "test-deleted"; |
| 180 | - scoreDelta = -20; | |
| 243 | + baseDelta = -20; | |
| 181 | 244 | } else if (!redFailed) { |
| 182 | 245 | status = "red-did-not-fail"; |
| 183 | - scoreDelta = -5; | |
| 246 | + baseDelta = -5; | |
| 184 | 247 | } else if (greenPassed === false) { |
| 185 | 248 | status = "green-did-not-pass"; |
| 186 | - scoreDelta = -5; | |
| 249 | + baseDelta = -5; | |
| 187 | 250 | } else if (hiddenPassed === false) { |
| 188 | 251 | status = "hidden-tests-failed"; |
| 189 | - scoreDelta = 0; | |
| 252 | + baseDelta = 0; | |
| 190 | 253 | } else if (hiddenPassed === true) { |
| 191 | 254 | status = "verified"; |
| 192 | - scoreDelta = 20; | |
| 255 | + baseDelta = 20; | |
| 193 | 256 | } else { |
| 194 | 257 | status = "discipline-only"; |
| 195 | - scoreDelta = 5; | |
| 258 | + baseDelta = 5; | |
| 196 | 259 | } |
| 197 | - steps.push({ stepId, redSha, greenSha, redFailed, greenPassed, hiddenPassed, status, scoreDelta }); | |
| 260 | + const scoreDelta = applyMode(baseDelta, mode); | |
| 261 | + const explanation = explainStep({ status, redSha, greenSha, hiddenPassed, mode }); | |
| 262 | + steps.push({ stepId, redSha, greenSha, redFailed, greenPassed, hiddenPassed, status, scoreDelta, explanation }); | |
| 198 | 263 | } |
| 199 | 264 | |
| 200 | 265 | // Refactor commits aren't tied to red→green pairs: the spec rewards |
| @@ -206,18 +271,20 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => { | ||
| 206 | 271 | if (c.phase !== "refactor") continue; |
| 207 | 272 | await runProc(["git", "checkout", "--quiet", c.sha], cwd, 5000); |
| 208 | 273 | const passed = await runTests(cwd); |
| 274 | + const baseDelta = passed ? 5 : -5; | |
| 209 | 275 | refactors.push({ |
| 210 | 276 | sha: c.sha, |
| 211 | 277 | stepId: c.step, |
| 212 | 278 | testsPassed: passed, |
| 213 | - scoreDelta: passed ? 5 : -5, | |
| 279 | + scoreDelta: applyMode(baseDelta, mode), | |
| 280 | + explanation: explainRefactor(passed), | |
| 214 | 281 | }); |
| 215 | 282 | } |
| 216 | 283 | |
| 217 | 284 | const totalScore = |
| 218 | 285 | steps.reduce((a, s) => a + s.scoreDelta, 0) + |
| 219 | 286 | refactors.reduce((a, r) => a + r.scoreDelta, 0); |
| 220 | - const verdict: Verdict = { headSha, steps, refactors, totalScore, judgedAt: Date.now() }; | |
| 287 | + const verdict: Verdict = { headSha, mode, steps, refactors, totalScore, judgedAt: Date.now() }; | |
| 221 | 288 | saveRun(owner, repo, verdict); |
| 222 | 289 | return verdict; |
| 223 | 290 | } finally { |
src/server.ts
+13
−6
| @@ -432,9 +432,13 @@ const renderRepoView = async (owner: string, repo: string): Promise<Response> => | ||
| 432 | 432 | if (status === "no-green") return "muted"; |
| 433 | 433 | return "red"; |
| 434 | 434 | }; |
| 435 | + const modeLabel = (m: string): string => { | |
| 436 | + const cls = m === "strict" ? "red" : m === "pragmatic" ? "blue" : "green"; | |
| 437 | + return `<span class="${cls}">${m}</span>`; | |
| 438 | + }; | |
| 435 | 439 | const rows = verdict.steps.length === 0 |
| 436 | 440 | ? "_No red→green pairs found yet._" |
| 437 | - : `| step | red | green | hidden | status | points |\n|---|---|---|---|---|---|\n` + | |
| 441 | + : `| step | red | green | hidden | status | points | explanation |\n|---|---|---|---|---|---|---|\n` + | |
| 438 | 442 | verdict.steps.map((s) => { |
| 439 | 443 | const cls = statusClass(s.status); |
| 440 | 444 | const sign = s.scoreDelta >= 0 ? "+" : ""; |
| @@ -442,18 +446,21 @@ const renderRepoView = async (owner: string, repo: string): Promise<Response> => | ||
| 442 | 446 | s.hiddenPassed === true ? `<span class="green">pass</span>` : |
| 443 | 447 | s.hiddenPassed === false ? `<span class="red">fail</span>` : |
| 444 | 448 | `<span class="muted">—</span>`; |
| 445 | - return `| \`${s.stepId}\` | \`${s.redSha?.slice(0, 7) ?? "—"}\` | \`${s.greenSha?.slice(0, 7) ?? "—"}\` | ${hiddenCell} | <span class="${cls}">${s.status}</span> | ${sign}${s.scoreDelta} |`; | |
| 449 | + const explanation = (s.explanation ?? "").replace(/\|/g, "\\|"); | |
| 450 | + return `| \`${s.stepId}\` | \`${s.redSha?.slice(0, 7) ?? "—"}\` | \`${s.greenSha?.slice(0, 7) ?? "—"}\` | ${hiddenCell} | <span class="${cls}">${s.status}</span> | ${sign}${s.scoreDelta} | ${explanation} |`; | |
| 446 | 451 | }).join("\n"); |
| 447 | 452 | const refactorRows = (verdict.refactors ?? []).length === 0 |
| 448 | 453 | ? "" |
| 449 | - : `\n\n### refactors\n\n| sha | step | tests | points |\n|---|---|---|---|\n` + | |
| 454 | + : `\n\n### refactors\n\n| sha | step | tests | points | explanation |\n|---|---|---|---|---|\n` + | |
| 450 | 455 | verdict.refactors.map((r) => { |
| 451 | 456 | const sign = r.scoreDelta >= 0 ? "+" : ""; |
| 452 | 457 | const cls = r.testsPassed ? "green" : "red"; |
| 453 | - const verdict = r.testsPassed ? "green" : "broke tests"; | |
| 454 | - return `| \`${r.sha.slice(0, 7)}\` | ${r.stepId ? `\`${r.stepId}\`` : "—"} | <span class="${cls}">${verdict}</span> | ${sign}${r.scoreDelta} |`; | |
| 458 | + const verb = r.testsPassed ? "green" : "broke tests"; | |
| 459 | + const explanation = (r.explanation ?? "").replace(/\|/g, "\\|"); | |
| 460 | + return `| \`${r.sha.slice(0, 7)}\` | ${r.stepId ? `\`${r.stepId}\`` : "—"} | <span class="${cls}">${verb}</span> | ${sign}${r.scoreDelta} | ${explanation} |`; | |
| 455 | 461 | }).join("\n"); |
| 456 | - scoreSection = `**total: ${sign}${verdict.totalScore}** · judged ${relativeTime(new Date(verdict.judgedAt).toISOString())}${stale}\n\n${rows}${refactorRows}`; | |
| 462 | + const modeLine = verdict.mode ? `**mode: ${modeLabel(verdict.mode)}** · ` : ""; | |
| 463 | + scoreSection = `${modeLine}**total: ${sign}${verdict.totalScore}** · judged ${relativeTime(new Date(verdict.judgedAt).toISOString())}${stale}\n\n${rows}${refactorRows}`; | |
| 457 | 464 | } |
| 458 | 465 | |
| 459 | 466 | const body = `# ${owner} · playing ${kataLink} |