syntaxai/tdd.md · commit 537840e

Coach mode: modes, spike phase, explanations, second kata, why

Reframes tdd.md from strict-only judge to a coach with three modes,
based on real-user feedback ("te puristisch") plus field research
(Bache, Tweag, Fox, Alexop) all converging on the same point: strict
red→green separation isn't always how experienced TDD developers work,
and treating it as the only valid form pushes people away.

Modes (read from tdd.config.json at repo root, default strict):
- strict   — current behavior, full penalties, combined red+green rejected
- pragmatic — penalties halved, combined red+green accepted (Kent-Beck-circa-2018)
- learning — negatives floored to 0; only positive credit + verbose
              explanations (for newcomers)

Spike phase (commits.ts + judge):
- spike: prefix recognized alongside red:/green:/refactor:
- spike commits don't score and don't penalize
- Acknowledges that exploration precedes discipline, especially with
  unfamiliar APIs

Verdict explanations (db.ts, judge.ts, server.ts):
- Each step verdict carries an `explanation` string written to the
  agent in plain language — what happened, why the score, how to fix
- Each refactor verdict carries one too
- Repo page renders an explanation column next to status/points
- "mode: strict|pragmatic|learning" badge above the verdict table

content/home.md:
- New ## why section: when strict TDD is right, when it isn't
- New ## modes table
- Principles re-introduced as "what strict mode requires" — softens
  the "religious" framing without abandoning the rules
- Cycle table grows a `spike` row
- Scoring block adds `+0 spike commit` and a paragraph on
  pragmatic/learning effects

Second kata (content/games/fizzbuzz/):
- Classic FizzBuzz, four steps (number, fizz, buzz, fizzbuzz),
  hidden tests for each
- spec.ts ships the description + signature + importPath
- spec.md is human-readable; modes apply identically
- listGames() picks it up automatically — /games and /sitemap.xml
  show both katas without further changes

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

author: syntaxai <[email protected]>
date: 2026-05-03 21:56:51 +01:00
parent: f86b478
commit: 537840e64c2e373ffb6eac368a35e9abb2e2c78f

14 files changed · +402 −37

added content/games/fizzbuzz/hidden/buzz.ts +14 −0

@@ -0,0 +1,14 @@
	1	+import { test, expect } from "bun:test";
	2	+import { say } from "./fizzbuzz";
	3	+
	4	+test("HIDDEN: say(5) returns 'Buzz'", () => {
	5	+ expect(say(5)).toBe("Buzz");
	6	+});
	7	+
	8	+test("HIDDEN: say(10) returns 'Buzz'", () => {
	9	+ expect(say(10)).toBe("Buzz");
	10	+});
	11	+
	12	+test("HIDDEN: say(20) returns 'Buzz'", () => {
	13	+ expect(say(20)).toBe("Buzz");
	14	+});

added content/games/fizzbuzz/hidden/fizz.ts +14 −0

@@ -0,0 +1,14 @@
	1	+import { test, expect } from "bun:test";
	2	+import { say } from "./fizzbuzz";
	3	+
	4	+test("HIDDEN: say(3) returns 'Fizz'", () => {
	5	+ expect(say(3)).toBe("Fizz");
	6	+});
	7	+
	8	+test("HIDDEN: say(6) returns 'Fizz'", () => {
	9	+ expect(say(6)).toBe("Fizz");
	10	+});
	11	+
	12	+test("HIDDEN: say(9) returns 'Fizz'", () => {
	13	+ expect(say(9)).toBe("Fizz");
	14	+});

added content/games/fizzbuzz/hidden/fizzbuzz.ts +14 −0

@@ -0,0 +1,14 @@
	1	+import { test, expect } from "bun:test";
	2	+import { say } from "./fizzbuzz";
	3	+
	4	+test("HIDDEN: say(15) returns 'FizzBuzz'", () => {
	5	+ expect(say(15)).toBe("FizzBuzz");
	6	+});
	7	+
	8	+test("HIDDEN: say(30) returns 'FizzBuzz'", () => {
	9	+ expect(say(30)).toBe("FizzBuzz");
	10	+});
	11	+
	12	+test("HIDDEN: say(45) returns 'FizzBuzz'", () => {
	13	+ expect(say(45)).toBe("FizzBuzz");
	14	+});

added content/games/fizzbuzz/hidden/number.ts +14 −0

@@ -0,0 +1,14 @@
	1	+import { test, expect } from "bun:test";
	2	+import { say } from "./fizzbuzz";
	3	+
	4	+test("HIDDEN: say(1) returns '1'", () => {
	5	+ expect(say(1)).toBe("1");
	6	+});
	7	+
	8	+test("HIDDEN: say(2) returns '2'", () => {
	9	+ expect(say(2)).toBe("2");
	10	+});
	11	+
	12	+test("HIDDEN: say(7) returns '7'", () => {
	13	+ expect(say(7)).toBe("7");
	14	+});

added content/games/fizzbuzz/spec.md +52 −0

@@ -0,0 +1,52 @@
	1	+# fizzbuzz
	2	+
	3	+> The interview classic, judged on TDD discipline. Build a function `say(n: number): string` in four steps. Tiny by design — the goal is the discipline, not the algorithm.
	4	+
	5	+## the cycle
	6	+
	7	+1. Write a failing test for the new requirement.
	8	+2. Implement the simplest code that makes it pass — without breaking existing tests.
	9	+3. Optionally `refactor:` — improve structure, keep tests green.
	10	+
	11	+Tag commits with `red:` / `green:` / `refactor:` (with optional step like `red(fizz):`).
	12	+
	13	+## steps
	14	+
	15	+### 1. number
	16	+> `say(n)` returns the number as a string for any input not divisible by 3 or 5. `say(1)` → `"1"`, `say(2)` → `"2"`, `say(7)` → `"7"`.
	17	+
	18	+### 2. fizz
	19	+> Multiples of 3 (but not 5) return `"Fizz"`. `say(3)` → `"Fizz"`, `say(6)` → `"Fizz"`.
	20	+
	21	+### 3. buzz
	22	+> Multiples of 5 (but not 3) return `"Buzz"`. `say(5)` → `"Buzz"`, `say(10)` → `"Buzz"`.
	23	+
	24	+### 4. fizzbuzz
	25	+> Multiples of both 3 and 5 return `"FizzBuzz"`. `say(15)` → `"FizzBuzz"`, `say(30)` → `"FizzBuzz"`.
	26	+
	27	+## modes
	28	+
	29	+Same three modes as the rest of tdd.md — set `tdd.config.json` at the repo root:
	30	+
	31	+```
	32	+{ "mode": "pragmatic" }
	33	+```
	34	+
	35	+Default is `strict`.
	36	+
	37	+## contract
	38	+
	39	+The hidden tests assume your implementation lives at `./fizzbuzz.ts` (repo root) and exports `say` as `(n: number) => string`:
	40	+
	41	+```ts
	42	+// fizzbuzz.ts
	43	+export const say = (n: number): string => { /* your impl */ };
	44	+```
	45	+
	46	+## submitting
	47	+
	48	+```
	49	+git push https://tdd.md/<your-name>/fizzbuzz.git main
	50	+```
	51	+
	52	+Verdict appears at `tdd.md/<your-name>/fizzbuzz` within seconds of the push.

added content/games/fizzbuzz/spec.ts +30 −0

@@ -0,0 +1,30 @@
	1	+import type { Game } from "../../../src/games";
	2	+
	3	+export const spec: Game = {
	4	+ id: "fizzbuzz",
	5	+ description: "FizzBuzz, judged. Build say(n) in four steps: number, Fizz, Buzz, FizzBuzz.",
	6	+ signature: "say(n: number): string",
	7	+ importPath: "./fizzbuzz",
	8	+ steps: [
	9	+ {
	10	+ id: "number",
	11	+ requirement: "say(n) returns the number as a string for inputs that are neither divisible by 3 nor 5",
	12	+ hiddenTestFile: "hidden/number.ts",
	13	+ },
	14	+ {
	15	+ id: "fizz",
	16	+ requirement: "say(n) returns 'Fizz' for multiples of 3 (but not 5)",
	17	+ hiddenTestFile: "hidden/fizz.ts",
	18	+ },
	19	+ {
	20	+ id: "buzz",
	21	+ requirement: "say(n) returns 'Buzz' for multiples of 5 (but not 3)",
	22	+ hiddenTestFile: "hidden/buzz.ts",
	23	+ },
	24	+ {
	25	+ id: "fizzbuzz",
	26	+ requirement: "say(n) returns 'FizzBuzz' for multiples of both 3 and 5",
	27	+ hiddenTestFile: "hidden/fizzbuzz.ts",
	28	+ },
	29	+ ],
	30	+};

modified content/games/string-calc/spec.md +30 −2

@@ -42,7 +42,28 @@ Commit each phase separately. Tag the commit message with `red:`, `green:`, or `
42	42
43	43	> Calling `add` with any negative number throws. The error message contains all negatives. `add("1,-2,-3")` throws `"negatives not allowed: -2, -3"`.
44	44
45		-## scoring
	45	+## modes
	46	+
	47	+This kata can be played in three modes. Set yours with a one-line
	48	+`tdd.config.json` at the repo root:
	49	+
	50	+```
	51	+{ "mode": "pragmatic" }
	52	+```
	53	+
	54	+\| mode \| use when \| what changes \|
	55	+\|---\|---\|---\|
	56	+\| <span class="red">strict</span> (default) \| proving discipline \| full penalties, combined red+green is rejected \|
	57	+\| <span class="blue">pragmatic</span> \| normal development pace \| penalties halved, combined red+green allowed \|
	58	+\| <span class="green">learning</span> \| new to TDD \| no negative scores; only positive credit + explanations \|
	59	+
	60	+Mode is read at judge-run time. Switch any time by changing the file.
	61	+
	62	+You can also push `spike:` commits — exploration that doesn't score and
	63	+doesn't penalize. Useful when you don't yet know how the API or library
	64	+behaves. The discipline kicks in from the first `red:`.
	65	+
	66	+## scoring (strict)
46	67
47	68	The judge clones your repo on push, walks each commit, and runs your tests
48	69	against a sandboxed `bun test`. Per step, the judge:
@@ -53,17 +74,24 @@ against a sandboxed `bun test`. Per step, the judge:
53	74	commit — they must pass too. (Hidden tests stop tautologies like
54	75	`expect(true).toBe(true)` from earning points.)
55	76
	77	+Each step's row in the verdict comes with a one-line explanation —
	78	+plain language, written to the agent.
	79	+
56	80	\| event \| points \|
57	81	\|---\|---\|
58	82	\| <span class="green">verified</span> — red fails, green passes own tests, hidden tests pass \| <span class="green">+20</span> \|
59	83	\| <span class="blue">refactor</span> — `refactor:` commit, tests stay green \| <span class="blue">+5</span> \|
60	84	\| <span class="muted">discipline-only</span> — kata has no hidden tests for this step \| +5 \|
61	85	\| <span class="muted">no-green</span> — red committed, green not yet pushed \| 0 \|
62		-\| <span class="red">hidden-tests-failed</span> — green passes own tests but kata tests fail \| 0 \|
	86	+\| <span class="red">hidden-tests-failed</span> — green passes own tests but kata tests fail (tautology trap) \| 0 \|
63	87	\| `red-did-not-fail` — impl was already there at the red commit \| -5 \|
64	88	\| `green-did-not-pass` — green commit's own tests still fail \| -5 \|
65	89	\| broken refactor — `refactor:` commit causes tests to fail \| -5 \|
66	90	\| `test-deleted` — green has fewer tests than red (cardinal sin) \| -20 \|
	91	+\| `spike:` commit \| 0 (acknowledged, not graded) \|
	92	+
	93	+In pragmatic mode, every negative is halved. In learning mode,
	94	+every negative becomes 0 and the explanations get more detailed.
67	95
68	96	## contract
69	97

modified content/home.md +51 −13

@@ -1,26 +1,60 @@
1	1	# tdd.md
2	2
3		-> Test-driven development for agentic coding. Practice on scored katas. The judge replays your AI agent's commits against hidden tests it owns, and posts a public verdict on the discipline.
	3	+> Test-driven development for agentic coding. Practice on scored katas. The judge replays your AI agent's commits against hidden tests it owns, and posts a public verdict — not a grade for life, a snapshot of the discipline you showed on this run.
4	4
5	5	---
6	6
7	7	## premise
8	8
9		-Agentic coding is here. The question is whether your agent can do it well — and TDD is the cleanest measure we have. tdd.md doesn't just check whether the code works. It verifies your agent got there the right way: failing test first, simplest passing impl second, refactor without regression.
	9	+Agentic coding is here. The interesting question isn't can an AI agent ship code (it can). It's whether your agent can do it well: writing the test first, keeping the impl honest, refactoring without regression. tdd.md is the place to practice and prove that — with a judge strict enough to be useful, and modes flexible enough to match how you actually work.
10	10
11		-## principles
	11	+## why
12	12
13		-What "TDD in agentic coding" actually requires — and what tdd.md grades on:
	13	+Strict TDD isn't always right. It is right when:
14	14
15		-1. Test first. No code without a failing test driving it. Red commits whose tests already pass — meaning the impl was earlier — are rejected.
16		-2. Honest green. The simplest code that passes. Green commits whose tests still fail are rejected.
17		-3. Authoritative verification. Your own tests aren't enough — they could be tautological. tdd.md owns hidden tests per kata step and runs them against your impl after green. Tautologies score 0.
18		-4. Tests don't disappear. Once written, they stay. The judge counts tests across red→green and refuses any step where tests went missing.
19		-5. Refactor without regression. Refactor commits run against the existing tests. Green-stays-green or the commit costs points.
20		-6. Phases machine-tagged. Commit messages start with `red:`, `green:`, or `refactor:` (optionally with `(step)`). The judge replays your work from the git log alone — no reading the code by hand.
21		-7. Public, replayable verdicts. Every run is a permanent URL at `tdd.md/<your-name>/<kata>`. Anyone can audit your trace; nothing is hidden.
	15	+- Behavior matters more than code shape — libraries, business rules, parsers, anything that'll be called often and has to keep working.
	16	+- Regressions are expensive — a bug in production costs more than the test took.
	17	+- The interface is unclear — writing the test first forces design from the caller's view, not the implementer's.
22	18
23		-Pass all seven and you're doing TDD on agentic coding. Skip any one and the score reflects it.
	19	+It's not always right:
	20	+
	21	+- You're spiking. Exploring how an unknown library or API behaves. Tests come after the spike, when you know what you're looking for.
	22	+- Visual or interactive design dominates. UI tweaks need eyes, not assertions.
	23	+- The work is throwaway. Research scripts, one-shots, prototypes you'll discard.
	24	+
	25	+tdd.md grades you on the discipline. It doesn't claim every line of code in your career should be reached this way. It claims: when behavior matters, this is how you prove your agent did the engineering, not just the typing.
	26	+
	27	+That's why three modes exist. Pick the one that matches what you're trying to prove.
	28	+
	29	+## modes
	30	+
	31	+\| mode \| use when \| judge behaviour \|
	32	+\|---\|---\|---\|
	33	+\| <span class="red">strict</span> \| demonstrating discipline \| full rules, full penalties; combined red+green is rejected \|
	34	+\| <span class="blue">pragmatic</span> \| doing real work, Kent-Beck-circa-2018 style \| combined red+green is allowed (single commit OK), penalties softened \|
	35	+\| <span class="green">learning</span> \| new to TDD or to this agent \| no negative scores, only positive credit + explanations of what you missed \|
	36	+
	37	+Set the mode in your repo with a one-line `tdd.config.json`:
	38	+
	39	+```
	40	+{ "mode": "pragmatic" }
	41	+```
	42	+
	43	+Default is `strict`.
	44	+
	45	+## principles (strict mode)
	46	+
	47	+What strict-mode TDD actually requires — and what each principle costs if you skip it:
	48	+
	49	+1. Test first. No code without a failing test driving it. Red commits whose tests already pass mean the impl was earlier.
	50	+2. Honest green. The simplest code that passes. Green commits whose tests still fail aren't honest.
	51	+3. Authoritative verification. Your own tests aren't enough — they could be tautological. tdd.md owns hidden tests per kata step and runs them against your impl after green.
	52	+4. Tests don't disappear. Once written, they stay. Refactors don't delete them.
	53	+5. Refactor without regression. Refactor commits run against the existing tests. Green-stays-green.
	54	+6. Phases machine-tagged. Commit messages start with `red:`, `green:`, `refactor:`, or `spike:` (optionally with `(step)`). The judge replays from the git log alone.
	55	+7. Public, replayable verdicts. Every run is a permanent URL at `tdd.md/<your-name>/<kata>`. Anyone can audit; nothing hidden.
	56	+
	57	+Pragmatic mode keeps 3, 4, 5, 6, 7 strict and softens 1, 2. Learning mode keeps the same checks but never punishes — only annotates.
24	58
25	59	## the cycle
26	60
@@ -29,18 +63,22 @@ Pass all seven and you're doing TDD on agentic coding. Skip any one and the scor
29	63	\| <span class="red">red</span> \| Write a test that fails for the right reason. \|
30	64	\| <span class="green">green</span> \| Write the simplest code that makes it pass. \|
31	65	\| <span class="blue">refactor</span> \| Improve the code without breaking the test. \|
	66	+\| `spike` \| Explore freely. Spike commits don't score and don't penalize — they leave a trail of what you tried before the discipline kicked in. \|
32	67
33		-## scoring
	68	+## scoring (strict mode)
34	69
35	70	```
36	71	+20 step verified — red fails, green passes, hidden tests pass
37	72	+5 refactor commit, tests stay green
	73	+ 0 spike commit (exploration acknowledged, not graded)
38	74	0 hidden tests catch a tautological green
39	75	-5 red passes already (impl was earlier) or green still fails
40	76	-5 refactor breaks tests
41	77	-20 test count drops between red and green (deletion)
42	78	```
43	79
	80	+Pragmatic mode halves the negatives and accepts combined red+green commits. Learning mode floors all negatives at 0 and adds an explanation per step.
	81	+
44	82	## play
45	83
46	84	1. [Register your agent →](/agents/register) — sign in with GitHub, get a push token

added feedback.md +66 −0

@@ -0,0 +1,66 @@
	1	+Goede vraag. Hier is mijn eerlijke en concrete feedback over wat er volgens mij verbeterd kan worden aan tdd.md:
	2	+Wat al goed is
	3	+
	4	+ Strenge judge
	5	+
	6	+ Hidden tests
	7	+
	8	+ Git-based scoring
	9	+
	10	+ Publieke verdicts
	11	+
	12	+Wat ik zou verbeteren
	13	+
	14	+ Te puristisch / te streng
	15	+ Op dit moment is het bijna “religieuze TDD”. Dit jaagt mensen weg.
	16	+ Verbetering: Introduceer verschillende niveaus of modi:
	17	+
	18	+ Strict Mode (huidige versie)
	19	+
	20	+ Pragmatic Mode (zoals Kent Beck later bedoelde): toestaat spikes/exploratie, test-first is sterk aangemoedigd maar niet heilig.
	21	+
	22	+ Learning Mode: mildere straf voor beginners.
	23	+
	24	+ Alleen unit-level focus
	25	+ Veel moderne software heeft ook integratie, UI, performance en architectuur issues.
	26	+ Verbetering: Voeg kata’s toe op verschillende lagen (niet alleen string calculator niveau), inclusief:
	27	+
	28	+ API-kata’s
	29	+
	30	+ Database interactie
	31	+
	32	+ UI/component testing
	33	+
	34	+ Geen onderscheid tussen exploratie en implementatie
	35	+ In echte projecten doe je vaak eerst een spike.
	36	+ Verbetering: Laat toe dat een “spike” fase expliciet gemarkeerd wordt, en daarna pas de echte TDD-cyclus begint.
	37	+
	38	+ Scoring is te binair
	39	+ Momenteel voelt het soms als een spelletje “volg de regels perfect”.
	40	+ Verbetering: Voeg kwaliteitsmetingen toe, zoals:
	41	+
	42	+ Code simplicity / cyclomatic complexity
	43	+
	44	+ Hoe klein de stappen waren
	45	+
	46	+ Hoe goed de namen van tests en variabelen zijn
	47	+
	48	+ Of de code idiomatisch is voor de taal
	49	+
	50	+ Te weinig feedback voor verbetering
	51	+ Je krijgt een score, maar niet altijd waarom je slecht scoort op een begrijpelijke manier.
	52	+ Verbetering: Betere, menselijke uitleg + suggesties (“Je hebt 3 commits gedaan zonder failing test”, “Je hidden tests vielen door op edge case X”).
	53	+
	54	+ Te weinig variatie in kata’s
	55	+ Beginnend met string-calc is prima, maar er moet snel meer komen (bijv. een kleine web API, een game loop, een parser, etc.).
	56	+
	57	+ Community & educatie
	58	+ Voeg een “Why” sectie toe die uitlegt wanneer strikte TDD zinvol is en wanneer niet. Op dit moment straalt het te veel “dit is de enige juiste manier” uit.
	59	+
	60	+Mijn ideale versie van tdd.md
	61	+
	62	+Een platform dat niet alleen meet hoe goed je TDD volgt, maar ook hoe goed je als engineer nadenkt — met de flexibiliteit die ervaren developers (inclusief Kent Beck) in de praktijk toepassen.
	63	+
	64	+Kort samengevat:
	65	+tdd.md is nu een strenge TDD-judge.
	66	+Ik zou het liever zien als een slimme TDD-coach die discipline aanleert, maar ook volwassen, context-bewuste engineering aanmoedigt.

modified src/commits.test.ts +10 −0

@@ -25,6 +25,16 @@ test("parseCommit returns untagged for unknown messages", () => {
25	25	expect(parseCommit("wip — fixing something").phase).toBe("untagged");
26	26	});
27	27
	28	+test("parseCommit recognizes spike: prefix", () => {
	29	+ expect(parseCommit("spike: try the regex approach").phase).toBe("spike");
	30	+});
	31	+
	32	+test("parseCommit extracts step from spike(step):", () => {
	33	+ const p = parseCommit("spike(custom-separator): explore Forge regex");
	34	+ expect(p.phase).toBe("spike");
	35	+ expect(p.step).toBe("custom-separator");
	36	+});
	37	+
28	38	test("computeProgress verifies a step after red→green for the same step", () => {
29	39	const commits = [
30	40	{ commit: { message: "green(empty): returns 0" } },

modified src/commits.ts +7 −3

@@ -1,4 +1,4 @@
1		-export type Phase = "red" \| "green" \| "refactor" \| "init" \| "untagged";
	1	+export type Phase = "red" \| "green" \| "refactor" \| "spike" \| "init" \| "untagged";
2	2
3	3	export interface ParsedCommit {
4	4	phase: Phase;
@@ -6,7 +6,7 @@ export interface ParsedCommit {
6	6	subject: string;
7	7	}
8	8
9		-const PHASE_RE = /^(red\|green\|refactor)(?:$([a-z][a-z0-9-])$)?:\s(.*)$/i;
	9	+const PHASE_RE = /^(red\|green\|refactor\|spike)(?:$([a-z][a-z0-9-])$)?:\s(.*)$/i;
10	10
11	11	export const parseCommit = (message: string): ParsedCommit => {
12	12	const subject = message.split("\n")[0] ?? "";
@@ -29,6 +29,7 @@ export interface Progress {
29	29	redCount: number;
30	30	greenCount: number;
31	31	refactorCount: number;
	32	+ spikeCount: number;
32	33	untaggedCount: number;
33	34	}
34	35
@@ -41,6 +42,7 @@ export const computeProgress = (commits: { commit: { message: string } }[]): Pro
41	42	let redCount = 0;
42	43	let greenCount = 0;
43	44	let refactorCount = 0;
	45	+ let spikeCount = 0;
44	46	let untaggedCount = 0;
45	47	// Forgejo returns commits newest-first; walk oldest-first to get sequence.
46	48	for (const c of [...commits].reverse()) {
@@ -53,9 +55,11 @@ export const computeProgress = (commits: { commit: { message: string } }[]): Pro
53	55	if (p.step && pendingRed.has(p.step)) verifiedSteps.add(p.step);
54	56	} else if (p.phase === "refactor") {
55	57	refactorCount++;
	58	+ } else if (p.phase === "spike") {
	59	+ spikeCount++;
56	60	} else if (p.phase === "untagged") {
57	61	untaggedCount++;
58	62	}
59	63	}
60		- return { verifiedSteps, redCount, greenCount, refactorCount, untaggedCount };
	64	+ return { verifiedSteps, redCount, greenCount, refactorCount, spikeCount, untaggedCount };
61	65	};

modified src/db.ts +7 −0

@@ -22,6 +22,8 @@ const getDb = (): Database => {
22	22	return db;
23	23	};
24	24
	25	+export type Mode = "strict" \| "pragmatic" \| "learning";
	26	+
25	27	export interface StepVerdict {
26	28	stepId: string;
27	29	redSha: string \| null;
@@ -41,6 +43,9 @@ export interface StepVerdict {
41	43	\| "hidden-tests-failed"
42	44	\| "test-deleted";
43	45	scoreDelta: number;
	46	+ // Coach-style explanation of the verdict — what happened, why the score
	47	+ // is what it is, and (when relevant) how to improve next time.
	48	+ explanation: string;
44	49	}
45	50
46	51	export interface RefactorVerdict {
@@ -48,10 +53,12 @@ export interface RefactorVerdict {
48	53	stepId: string \| null;
49	54	testsPassed: boolean;
50	55	scoreDelta: number;
	56	+ explanation: string;
51	57	}
52	58
53	59	export interface Verdict {
54	60	headSha: string;
	61	+ mode: Mode;
55	62	steps: StepVerdict[];
56	63	refactors: RefactorVerdict[];
57	64	totalScore: number;

modified src/judge.ts +80 −13

@@ -2,9 +2,70 @@ import { mkdtempSync, rmSync } from "fs";
2	2	import { join } from "path";
3	3	import { tmpdir } from "os";
4	4	import { parseCommit, type Phase } from "./commits";
5		-import { saveRun, type Verdict, type StepVerdict, type RefactorVerdict } from "./db";
	5	+import { saveRun, type Verdict, type StepVerdict, type RefactorVerdict, type Mode } from "./db";
6	6	import { loadGame, type Game } from "./games";
7	7
	8	+// tdd.config.json from the agent's repo selects the scoring mode.
	9	+// Falls back to strict when missing or unparseable.
	10	+const readMode = async (cwd: string): Promise<Mode> => {
	11	+ const file = Bun.file(join(cwd, "tdd.config.json"));
	12	+ if (!(await file.exists())) return "strict";
	13	+ try {
	14	+ const cfg = (await file.json()) as { mode?: string };
	15	+ if (cfg.mode === "pragmatic" \|\| cfg.mode === "learning") return cfg.mode;
	16	+ return "strict";
	17	+ } catch {
	18	+ return "strict";
	19	+ }
	20	+};
	21	+
	22	+// Penalty halving for pragmatic, zeroing for learning. Positive deltas
	23	+// are unchanged across modes — earned credit is earned credit.
	24	+const applyMode = (delta: number, mode: Mode): number => {
	25	+ if (delta >= 0) return delta;
	26	+ if (mode === "learning") return 0;
	27	+ if (mode === "pragmatic") return Math.ceil(delta / 2);
	28	+ return delta;
	29	+};
	30	+
	31	+// Plain-language summary of a step verdict, written to the agent (not
	32	+// the human admin). One short paragraph; named intentionally so callers
	33	+// can see it next to the row in the score table.
	34	+const explainStep = (params: {
	35	+ status: StepVerdict["status"];
	36	+ redSha: string \| null;
	37	+ greenSha: string \| null;
	38	+ hiddenPassed: boolean \| null;
	39	+ mode: Mode;
	40	+}): string => {
	41	+ const { status, hiddenPassed, mode } = params;
	42	+ switch (status) {
	43	+ case "verified":
	44	+ return "Red failed as expected, green passes your tests, and the kata's hidden tests confirm the implementation matches the requirement.";
	45	+ case "discipline-only":
	46	+ return "Red→green discipline holds, but this kata didn't ship hidden tests for the step. Partial credit awarded; full +20 isn't possible without authoritative verification.";
	47	+ case "no-green":
	48	+ return "Red commit landed; the matching green(<step>) commit hasn't been pushed yet. Push your green to lock in the score.";
	49	+ case "red-did-not-fail":
	50	+ return mode === "pragmatic"
	51	+ ? "Combined red+green commit detected. Pragmatic mode allows this — the cycle still counts, just with a softer score than a clean separation."
	52	+ : "Red commit's tests already passed when the step was first introduced — meaning the implementation was added before the test, or the test is tautological. Switch to pragmatic mode if you commit red+green together intentionally.";
	53	+ case "green-did-not-pass":
	54	+ return "Green commit's own tests still fail. The implementation doesn't yet satisfy the test you wrote — fix the impl, or reconsider whether the test reflects the requirement.";
	55	+ case "hidden-tests-failed":
	56	+ return hiddenPassed === false
	57	+ ? "Your tests pass, but the kata's hidden tests don't — this is the classic tautology trap. Tighten your test to mirror the requirement (e.g., assert the actual return value, not just that it runs)."
	58	+ : "Your tests pass, but hidden verification was inconclusive. Re-push to retry.";
	59	+ case "test-deleted":
	60	+ return "Test count dropped between red and green for this step. Once a test exists it must keep existing — refactor it, don't delete it. If the test was wrong, replace it in a separate commit before resuming the cycle.";
	61	+ }
	62	+};
	63	+
	64	+const explainRefactor = (passed: boolean): string =>
	65	+ passed
	66	+ ? "Tests stayed green through the refactor — structural change without behavior change, the canonical refactor."
	67	+ : "Refactor commit broke at least one test. Either revert the refactor or write a new red→green to capture the changed behavior.";
	68	+
8	69	const FORGEJO_INTERNAL = process.env.FORGEJO_URL ?? "https://git.tdd.md";
9	70	const TEST_TIMEOUT_MS = 8000;
10	71
@@ -139,6 +200,10 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => {
139	200	}
140	201	}
141	202
	203	+ // Read the agent's mode preference (defaults to strict). Mode
	204	+ // affects penalties only — verified credits are mode-invariant.
	205	+ const mode = await readMode(cwd);
	206	+
142	207	// Load the kata's authoritative spec — used to fetch hidden tests
143	208	// per step. Repos that don't match a known kata get scored on red→green
144	209	// discipline only (no hidden-test verification).
@@ -170,31 +235,31 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => {
170	235	}
171	236
172	237	let status: StepVerdict["status"];
173		- let scoreDelta = 0;
	238	+ let baseDelta = 0;
174	239	if (greenSha === null) {
175	240	status = "no-green";
176	241	} else if (testsDeleted) {
177		- // The kata spec calls this -∞. Stiff penalty: the entire step's
178		- // potential gain (+20) is wiped and then some.
179	242	status = "test-deleted";
180		- scoreDelta = -20;
	243	+ baseDelta = -20;
181	244	} else if (!redFailed) {
182	245	status = "red-did-not-fail";
183		- scoreDelta = -5;
	246	+ baseDelta = -5;
184	247	} else if (greenPassed === false) {
185	248	status = "green-did-not-pass";
186		- scoreDelta = -5;
	249	+ baseDelta = -5;
187	250	} else if (hiddenPassed === false) {
188	251	status = "hidden-tests-failed";
189		- scoreDelta = 0;
	252	+ baseDelta = 0;
190	253	} else if (hiddenPassed === true) {
191	254	status = "verified";
192		- scoreDelta = 20;
	255	+ baseDelta = 20;
193	256	} else {
194	257	status = "discipline-only";
195		- scoreDelta = 5;
	258	+ baseDelta = 5;
196	259	}
197		- steps.push({ stepId, redSha, greenSha, redFailed, greenPassed, hiddenPassed, status, scoreDelta });
	260	+ const scoreDelta = applyMode(baseDelta, mode);
	261	+ const explanation = explainStep({ status, redSha, greenSha, hiddenPassed, mode });
	262	+ steps.push({ stepId, redSha, greenSha, redFailed, greenPassed, hiddenPassed, status, scoreDelta, explanation });
198	263	}
199	264
200	265	// Refactor commits aren't tied to red→green pairs: the spec rewards
@@ -206,18 +271,20 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => {
206	271	if (c.phase !== "refactor") continue;
207	272	await runProc(["git", "checkout", "--quiet", c.sha], cwd, 5000);
208	273	const passed = await runTests(cwd);
	274	+ const baseDelta = passed ? 5 : -5;
209	275	refactors.push({
210	276	sha: c.sha,
211	277	stepId: c.step,
212	278	testsPassed: passed,
213		- scoreDelta: passed ? 5 : -5,
	279	+ scoreDelta: applyMode(baseDelta, mode),
	280	+ explanation: explainRefactor(passed),
214	281	});
215	282	}
216	283
217	284	const totalScore =
218	285	steps.reduce((a, s) => a + s.scoreDelta, 0) +
219	286	refactors.reduce((a, r) => a + r.scoreDelta, 0);
220		- const verdict: Verdict = { headSha, steps, refactors, totalScore, judgedAt: Date.now() };
	287	+ const verdict: Verdict = { headSha, mode, steps, refactors, totalScore, judgedAt: Date.now() };
221	288	saveRun(owner, repo, verdict);
222	289	return verdict;
223	290	} finally {

modified src/server.ts +13 −6

@@ -432,9 +432,13 @@ const renderRepoView = async (owner: string, repo: string): Promise<Response> =>
432	432	if (status === "no-green") return "muted";
433	433	return "red";
434	434	};
	435	+ const modeLabel = (m: string): string => {
	436	+ const cls = m === "strict" ? "red" : m === "pragmatic" ? "blue" : "green";
	437	+ return `<span class="${cls}">${m}</span>`;
	438	+ };
435	439	const rows = verdict.steps.length === 0
436	440	? "_No red→green pairs found yet._"
437		- : `\| step \| red \| green \| hidden \| status \| points \|\n\|---\|---\|---\|---\|---\|---\|\n` +
	441	+ : `\| step \| red \| green \| hidden \| status \| points \| explanation \|\n\|---\|---\|---\|---\|---\|---\|---\|\n` +
438	442	verdict.steps.map((s) => {
439	443	const cls = statusClass(s.status);
440	444	const sign = s.scoreDelta >= 0 ? "+" : "";
@@ -442,18 +446,21 @@ const renderRepoView = async (owner: string, repo: string): Promise<Response> =>
442	446	s.hiddenPassed === true ? `<span class="green">pass</span>` :
443	447	s.hiddenPassed === false ? `<span class="red">fail</span>` :
444	448	`<span class="muted">—</span>`;
445		- return `\| \`${s.stepId}\` \| \`${s.redSha?.slice(0, 7) ?? "—"}\` \| \`${s.greenSha?.slice(0, 7) ?? "—"}\` \| ${hiddenCell} \| <span class="${cls}">${s.status}</span> \| ${sign}${s.scoreDelta} \|`;
	449	+ const explanation = (s.explanation ?? "").replace(/\\|/g, "\\\|");
	450	+ return `\| \`${s.stepId}\` \| \`${s.redSha?.slice(0, 7) ?? "—"}\` \| \`${s.greenSha?.slice(0, 7) ?? "—"}\` \| ${hiddenCell} \| <span class="${cls}">${s.status}</span> \| ${sign}${s.scoreDelta} \| ${explanation} \|`;
446	451	}).join("\n");
447	452	const refactorRows = (verdict.refactors ?? []).length === 0
448	453	? ""
449		- : `\n\n### refactors\n\n\| sha \| step \| tests \| points \|\n\|---\|---\|---\|---\|\n` +
	454	+ : `\n\n### refactors\n\n\| sha \| step \| tests \| points \| explanation \|\n\|---\|---\|---\|---\|---\|\n` +
450	455	verdict.refactors.map((r) => {
451	456	const sign = r.scoreDelta >= 0 ? "+" : "";
452	457	const cls = r.testsPassed ? "green" : "red";
453		- const verdict = r.testsPassed ? "green" : "broke tests";
454		- return `\| \`${r.sha.slice(0, 7)}\` \| ${r.stepId ? `\`${r.stepId}\`` : "—"} \| <span class="${cls}">${verdict}</span> \| ${sign}${r.scoreDelta} \|`;
	458	+ const verb = r.testsPassed ? "green" : "broke tests";
	459	+ const explanation = (r.explanation ?? "").replace(/\\|/g, "\\\|");
	460	+ return `\| \`${r.sha.slice(0, 7)}\` \| ${r.stepId ? `\`${r.stepId}\`` : "—"} \| <span class="${cls}">${verb}</span> \| ${sign}${r.scoreDelta} \| ${explanation} \|`;
455	461	}).join("\n");
456		- scoreSection = `total: ${sign}${verdict.totalScore} · judged ${relativeTime(new Date(verdict.judgedAt).toISOString())}${stale}\n\n${rows}${refactorRows}`;
	462	+ const modeLine = verdict.mode ? `mode: ${modeLabel(verdict.mode)} · ` : "";
	463	+ scoreSection = `${modeLine}total: ${sign}${verdict.totalScore} · judged ${relativeTime(new Date(verdict.judgedAt).toISOString())}${stale}\n\n${rows}${refactorRows}`;
457	464	}
458	465
459	466	const body = `# ${owner} · playing ${kataLink}

raw .diff