syntaxai/tdd.md · commit 537840e

Coach mode: modes, spike phase, explanations, second kata, why

Reframes tdd.md from strict-only judge to a coach with three modes,
based on real-user feedback ("te puristisch") plus field research
(Bache, Tweag, Fox, Alexop) all converging on the same point: strict
red→green separation isn't always how experienced TDD developers work,
and treating it as the only valid form pushes people away.

Modes (read from tdd.config.json at repo root, default strict):
- strict   — current behavior, full penalties, combined red+green rejected
- pragmatic — penalties halved, combined red+green accepted (Kent-Beck-circa-2018)
- learning — negatives floored to 0; only positive credit + verbose
              explanations (for newcomers)

Spike phase (commits.ts + judge):
- spike: prefix recognized alongside red:/green:/refactor:
- spike commits don't score and don't penalize
- Acknowledges that exploration precedes discipline, especially with
  unfamiliar APIs

Verdict explanations (db.ts, judge.ts, server.ts):
- Each step verdict carries an `explanation` string written to the
  agent in plain language — what happened, why the score, how to fix
- Each refactor verdict carries one too
- Repo page renders an explanation column next to status/points
- "mode: strict|pragmatic|learning" badge above the verdict table

content/home.md:
- New ## why section: when strict TDD is right, when it isn't
- New ## modes table
- Principles re-introduced as "what strict mode requires" — softens
  the "religious" framing without abandoning the rules
- Cycle table grows a `spike` row
- Scoring block adds `+0 spike commit` and a paragraph on
  pragmatic/learning effects

Second kata (content/games/fizzbuzz/):
- Classic FizzBuzz, four steps (number, fizz, buzz, fizzbuzz),
  hidden tests for each
- spec.ts ships the description + signature + importPath
- spec.md is human-readable; modes apply identically
- listGames() picks it up automatically — /games and /sitemap.xml
  show both katas without further changes

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
author
syntaxai <[email protected]>
date
2026-05-03 21:56:51 +01:00
parent
f86b478
commit
537840e64c2e373ffb6eac368a35e9abb2e2c78f

14 files changed · +402 −37

added content/games/fizzbuzz/hidden/buzz.ts +14 −0
@@ -0,0 +1,14 @@
1+import { test, expect } from "bun:test";
2+import { say } from "./fizzbuzz";
3+
4+test("HIDDEN: say(5) returns 'Buzz'", () => {
5+ expect(say(5)).toBe("Buzz");
6+});
7+
8+test("HIDDEN: say(10) returns 'Buzz'", () => {
9+ expect(say(10)).toBe("Buzz");
10+});
11+
12+test("HIDDEN: say(20) returns 'Buzz'", () => {
13+ expect(say(20)).toBe("Buzz");
14+});
added content/games/fizzbuzz/hidden/fizz.ts +14 −0
@@ -0,0 +1,14 @@
1+import { test, expect } from "bun:test";
2+import { say } from "./fizzbuzz";
3+
4+test("HIDDEN: say(3) returns 'Fizz'", () => {
5+ expect(say(3)).toBe("Fizz");
6+});
7+
8+test("HIDDEN: say(6) returns 'Fizz'", () => {
9+ expect(say(6)).toBe("Fizz");
10+});
11+
12+test("HIDDEN: say(9) returns 'Fizz'", () => {
13+ expect(say(9)).toBe("Fizz");
14+});
added content/games/fizzbuzz/hidden/fizzbuzz.ts +14 −0
@@ -0,0 +1,14 @@
1+import { test, expect } from "bun:test";
2+import { say } from "./fizzbuzz";
3+
4+test("HIDDEN: say(15) returns 'FizzBuzz'", () => {
5+ expect(say(15)).toBe("FizzBuzz");
6+});
7+
8+test("HIDDEN: say(30) returns 'FizzBuzz'", () => {
9+ expect(say(30)).toBe("FizzBuzz");
10+});
11+
12+test("HIDDEN: say(45) returns 'FizzBuzz'", () => {
13+ expect(say(45)).toBe("FizzBuzz");
14+});
added content/games/fizzbuzz/hidden/number.ts +14 −0
@@ -0,0 +1,14 @@
1+import { test, expect } from "bun:test";
2+import { say } from "./fizzbuzz";
3+
4+test("HIDDEN: say(1) returns '1'", () => {
5+ expect(say(1)).toBe("1");
6+});
7+
8+test("HIDDEN: say(2) returns '2'", () => {
9+ expect(say(2)).toBe("2");
10+});
11+
12+test("HIDDEN: say(7) returns '7'", () => {
13+ expect(say(7)).toBe("7");
14+});
added content/games/fizzbuzz/spec.md +52 −0
@@ -0,0 +1,52 @@
1+# fizzbuzz
2+
3+> The interview classic, judged on TDD discipline. Build a function `say(n: number): string` in four steps. Tiny by design — the goal is the discipline, not the algorithm.
4+
5+## the cycle
6+
7+1. Write a failing test for the new requirement.
8+2. Implement the simplest code that makes it pass — without breaking existing tests.
9+3. Optionally `refactor:` — improve structure, keep tests green.
10+
11+Tag commits with `red:` / `green:` / `refactor:` (with optional step like `red(fizz):`).
12+
13+## steps
14+
15+### 1. number
16+> `say(n)` returns the number as a string for any input not divisible by 3 or 5. `say(1)` → `"1"`, `say(2)` → `"2"`, `say(7)` → `"7"`.
17+
18+### 2. fizz
19+> Multiples of 3 (but not 5) return `"Fizz"`. `say(3)` → `"Fizz"`, `say(6)` → `"Fizz"`.
20+
21+### 3. buzz
22+> Multiples of 5 (but not 3) return `"Buzz"`. `say(5)` → `"Buzz"`, `say(10)` → `"Buzz"`.
23+
24+### 4. fizzbuzz
25+> Multiples of both 3 and 5 return `"FizzBuzz"`. `say(15)` → `"FizzBuzz"`, `say(30)` → `"FizzBuzz"`.
26+
27+## modes
28+
29+Same three modes as the rest of tdd.md — set `tdd.config.json` at the repo root:
30+
31+```
32+{ "mode": "pragmatic" }
33+```
34+
35+Default is `strict`.
36+
37+## contract
38+
39+The hidden tests assume your implementation lives at `./fizzbuzz.ts` (repo root) and exports `say` as `(n: number) => string`:
40+
41+```ts
42+// fizzbuzz.ts
43+export const say = (n: number): string => { /* your impl */ };
44+```
45+
46+## submitting
47+
48+```
49+git push https://tdd.md/<your-name>/fizzbuzz.git main
50+```
51+
52+Verdict appears at `tdd.md/<your-name>/fizzbuzz` within seconds of the push.
added content/games/fizzbuzz/spec.ts +30 −0
@@ -0,0 +1,30 @@
1+import type { Game } from "../../../src/games";
2+
3+export const spec: Game = {
4+ id: "fizzbuzz",
5+ description: "FizzBuzz, judged. Build say(n) in four steps: number, Fizz, Buzz, FizzBuzz.",
6+ signature: "say(n: number): string",
7+ importPath: "./fizzbuzz",
8+ steps: [
9+ {
10+ id: "number",
11+ requirement: "say(n) returns the number as a string for inputs that are neither divisible by 3 nor 5",
12+ hiddenTestFile: "hidden/number.ts",
13+ },
14+ {
15+ id: "fizz",
16+ requirement: "say(n) returns 'Fizz' for multiples of 3 (but not 5)",
17+ hiddenTestFile: "hidden/fizz.ts",
18+ },
19+ {
20+ id: "buzz",
21+ requirement: "say(n) returns 'Buzz' for multiples of 5 (but not 3)",
22+ hiddenTestFile: "hidden/buzz.ts",
23+ },
24+ {
25+ id: "fizzbuzz",
26+ requirement: "say(n) returns 'FizzBuzz' for multiples of both 3 and 5",
27+ hiddenTestFile: "hidden/fizzbuzz.ts",
28+ },
29+ ],
30+};
modified content/games/string-calc/spec.md +30 −2
@@ -42,7 +42,28 @@ Commit each phase separately. Tag the commit message with `red:`, `green:`, or `
4242
4343 > Calling `add` with any negative number throws. The error message contains all negatives. `add("1,-2,-3")` throws `"negatives not allowed: -2, -3"`.
4444
45-## scoring
45+## modes
46+
47+This kata can be played in three modes. Set yours with a one-line
48+`tdd.config.json` at the repo root:
49+
50+```
51+{ "mode": "pragmatic" }
52+```
53+
54+| mode | use when | what changes |
55+|---|---|---|
56+| <span class="red">**strict**</span> (default) | proving discipline | full penalties, combined red+green is rejected |
57+| <span class="blue">**pragmatic**</span> | normal development pace | penalties halved, combined red+green allowed |
58+| <span class="green">**learning**</span> | new to TDD | no negative scores; only positive credit + explanations |
59+
60+Mode is read at judge-run time. Switch any time by changing the file.
61+
62+You can also push `spike:` commits — exploration that doesn't score and
63+doesn't penalize. Useful when you don't yet know how the API or library
64+behaves. The discipline kicks in from the first `red:`.
65+
66+## scoring (strict)
4667
4768 The judge clones your repo on push, walks each commit, and runs your tests
4869 against a sandboxed `bun test`. Per step, the judge:
@@ -53,17 +74,24 @@ against a sandboxed `bun test`. Per step, the judge:
5374 commit — they must pass too. (Hidden tests stop tautologies like
5475 `expect(true).toBe(true)` from earning points.)
5576
77+Each step's row in the verdict comes with a one-line **explanation** —
78+plain language, written to the agent.
79+
5680 | event | points |
5781 |---|---|
5882 | <span class="green">verified</span> — red fails, green passes own tests, hidden tests pass | <span class="green">+20</span> |
5983 | <span class="blue">refactor</span> — `refactor:` commit, tests stay green | <span class="blue">+5</span> |
6084 | <span class="muted">discipline-only</span> — kata has no hidden tests for this step | +5 |
6185 | <span class="muted">no-green</span> — red committed, green not yet pushed | 0 |
62-| <span class="red">hidden-tests-failed</span> — green passes own tests but kata tests fail | 0 |
86+| <span class="red">hidden-tests-failed</span> — green passes own tests but kata tests fail (tautology trap) | 0 |
6387 | `red-did-not-fail` — impl was already there at the red commit | -5 |
6488 | `green-did-not-pass` — green commit's own tests still fail | -5 |
6589 | broken refactor — `refactor:` commit causes tests to fail | -5 |
6690 | `test-deleted` — green has fewer tests than red (cardinal sin) | -20 |
91+| `spike:` commit | 0 (acknowledged, not graded) |
92+
93+In **pragmatic** mode, every negative is halved. In **learning** mode,
94+every negative becomes 0 and the explanations get more detailed.
6795
6896 ## contract
6997
modified content/home.md +51 −13
@@ -1,26 +1,60 @@
11 # tdd.md
22
3-> Test-driven development for agentic coding. Practice on scored katas. The judge replays your AI agent's commits against hidden tests it owns, and posts a public verdict on the discipline.
3+> Test-driven development for agentic coding. Practice on scored katas. The judge replays your AI agent's commits against hidden tests it owns, and posts a public verdict — not a grade for life, a snapshot of the discipline you showed on this run.
44
55 ---
66
77 ## premise
88
9-Agentic coding is here. The question is whether your agent can do it *well* — and TDD is the cleanest measure we have. tdd.md doesn't just check whether the code works. It verifies your agent got there the right way: failing test first, simplest passing impl second, refactor without regression.
9+Agentic coding is here. The interesting question isn't *can* an AI agent ship code (it can). It's whether your agent can do it *well*: writing the test first, keeping the impl honest, refactoring without regression. tdd.md is the place to practice and prove that — with a judge strict enough to be useful, and modes flexible enough to match how you actually work.
1010
11-## principles
11+## why
1212
13-What "TDD in agentic coding" actually requires — and what tdd.md grades on:
13+Strict TDD isn't always right. It is right when:
1414
15-1. **Test first.** No code without a failing test driving it. Red commits whose tests already pass — meaning the impl was earlier — are rejected.
16-2. **Honest green.** The simplest code that passes. Green commits whose tests still fail are rejected.
17-3. **Authoritative verification.** Your own tests aren't enough — they could be tautological. tdd.md owns hidden tests per kata step and runs them against your impl after green. Tautologies score 0.
18-4. **Tests don't disappear.** Once written, they stay. The judge counts tests across red→green and refuses any step where tests went missing.
19-5. **Refactor without regression.** Refactor commits run against the existing tests. Green-stays-green or the commit costs points.
20-6. **Phases machine-tagged.** Commit messages start with `red:`, `green:`, or `refactor:` (optionally with `(step)`). The judge replays your work from the git log alone — no reading the code by hand.
21-7. **Public, replayable verdicts.** Every run is a permanent URL at `tdd.md/<your-name>/<kata>`. Anyone can audit your trace; nothing is hidden.
15+- **Behavior matters more than code shape** — libraries, business rules, parsers, anything that'll be called often and has to keep working.
16+- **Regressions are expensive** — a bug in production costs more than the test took.
17+- **The interface is unclear** — writing the test first forces design from the caller's view, not the implementer's.
2218
23-Pass all seven and you're doing TDD on agentic coding. Skip any one and the score reflects it.
19+It's not always right:
20+
21+- **You're spiking.** Exploring how an unknown library or API behaves. Tests come *after* the spike, when you know what you're looking for.
22+- **Visual or interactive design dominates.** UI tweaks need eyes, not assertions.
23+- **The work is throwaway.** Research scripts, one-shots, prototypes you'll discard.
24+
25+tdd.md grades you on the discipline. It doesn't claim every line of code in your career should be reached this way. It claims: when behavior matters, this is how you prove your agent did the engineering, not just the typing.
26+
27+That's why three modes exist. Pick the one that matches what you're trying to prove.
28+
29+## modes
30+
31+| mode | use when | judge behaviour |
32+|---|---|---|
33+| <span class="red">**strict**</span> | demonstrating discipline | full rules, full penalties; combined red+green is rejected |
34+| <span class="blue">**pragmatic**</span> | doing real work, Kent-Beck-circa-2018 style | combined red+green is allowed (single commit OK), penalties softened |
35+| <span class="green">**learning**</span> | new to TDD or to this agent | no negative scores, only positive credit + explanations of what you missed |
36+
37+Set the mode in your repo with a one-line `tdd.config.json`:
38+
39+```
40+{ "mode": "pragmatic" }
41+```
42+
43+Default is `strict`.
44+
45+## principles (strict mode)
46+
47+What strict-mode TDD actually requires — and what each principle costs if you skip it:
48+
49+1. **Test first.** No code without a failing test driving it. Red commits whose tests already pass mean the impl was earlier.
50+2. **Honest green.** The simplest code that passes. Green commits whose tests still fail aren't honest.
51+3. **Authoritative verification.** Your own tests aren't enough — they could be tautological. tdd.md owns hidden tests per kata step and runs them against your impl after green.
52+4. **Tests don't disappear.** Once written, they stay. Refactors don't delete them.
53+5. **Refactor without regression.** Refactor commits run against the existing tests. Green-stays-green.
54+6. **Phases machine-tagged.** Commit messages start with `red:`, `green:`, `refactor:`, or `spike:` (optionally with `(step)`). The judge replays from the git log alone.
55+7. **Public, replayable verdicts.** Every run is a permanent URL at `tdd.md/<your-name>/<kata>`. Anyone can audit; nothing hidden.
56+
57+Pragmatic mode keeps 3, 4, 5, 6, 7 strict and softens 1, 2. Learning mode keeps the same checks but never punishes — only annotates.
2458
2559 ## the cycle
2660
@@ -29,18 +63,22 @@ Pass all seven and you're doing TDD on agentic coding. Skip any one and the scor
2963 | <span class="red">**red**</span> | Write a test that fails for the right reason. |
3064 | <span class="green">**green**</span> | Write the simplest code that makes it pass. |
3165 | <span class="blue">**refactor**</span> | Improve the code without breaking the test. |
66+| `spike` | Explore freely. Spike commits don't score and don't penalize — they leave a trail of what you tried before the discipline kicked in. |
3267
33-## scoring
68+## scoring (strict mode)
3469
3570 ```
3671 +20 step verified — red fails, green passes, hidden tests pass
3772 +5 refactor commit, tests stay green
73+ 0 spike commit (exploration acknowledged, not graded)
3874 0 hidden tests catch a tautological green
3975 -5 red passes already (impl was earlier) or green still fails
4076 -5 refactor breaks tests
4177 -20 test count drops between red and green (deletion)
4278 ```
4379
80+Pragmatic mode halves the negatives and accepts combined red+green commits. Learning mode floors all negatives at 0 and adds an explanation per step.
81+
4482 ## play
4583
4684 1. [Register your agent →](/agents/register) — sign in with GitHub, get a push token
added feedback.md +66 −0
@@ -0,0 +1,66 @@
1+Goede vraag. Hier is mijn eerlijke en concrete feedback over wat er volgens mij verbeterd kan worden aan tdd.md:
2+Wat al goed is
3+
4+ Strenge judge
5+
6+ Hidden tests
7+
8+ Git-based scoring
9+
10+ Publieke verdicts
11+
12+Wat ik zou verbeteren
13+
14+ Te puristisch / te streng
15+ Op dit moment is het bijna “religieuze TDD”. Dit jaagt mensen weg.
16+ Verbetering: Introduceer verschillende niveaus of modi:
17+
18+ Strict Mode (huidige versie)
19+
20+ Pragmatic Mode (zoals Kent Beck later bedoelde): toestaat spikes/exploratie, test-first is sterk aangemoedigd maar niet heilig.
21+
22+ Learning Mode: mildere straf voor beginners.
23+
24+ Alleen unit-level focus
25+ Veel moderne software heeft ook integratie, UI, performance en architectuur issues.
26+ Verbetering: Voeg kata’s toe op verschillende lagen (niet alleen string calculator niveau), inclusief:
27+
28+ API-kata’s
29+
30+ Database interactie
31+
32+ UI/component testing
33+
34+ Geen onderscheid tussen exploratie en implementatie
35+ In echte projecten doe je vaak eerst een spike.
36+ Verbetering: Laat toe dat een “spike” fase expliciet gemarkeerd wordt, en daarna pas de echte TDD-cyclus begint.
37+
38+ Scoring is te binair
39+ Momenteel voelt het soms als een spelletje “volg de regels perfect”.
40+ Verbetering: Voeg kwaliteitsmetingen toe, zoals:
41+
42+ Code simplicity / cyclomatic complexity
43+
44+ Hoe klein de stappen waren
45+
46+ Hoe goed de namen van tests en variabelen zijn
47+
48+ Of de code idiomatisch is voor de taal
49+
50+ Te weinig feedback voor verbetering
51+ Je krijgt een score, maar niet altijd waarom je slecht scoort op een begrijpelijke manier.
52+ Verbetering: Betere, menselijke uitleg + suggesties (“Je hebt 3 commits gedaan zonder failing test”, “Je hidden tests vielen door op edge case X”).
53+
54+ Te weinig variatie in kata’s
55+ Beginnend met string-calc is prima, maar er moet snel meer komen (bijv. een kleine web API, een game loop, een parser, etc.).
56+
57+ Community & educatie
58+ Voeg een “Why” sectie toe die uitlegt wanneer strikte TDD zinvol is en wanneer niet. Op dit moment straalt het te veel “dit is de enige juiste manier” uit.
59+
60+Mijn ideale versie van tdd.md
61+
62+Een platform dat niet alleen meet hoe goed je TDD volgt, maar ook hoe goed je als engineer nadenkt — met de flexibiliteit die ervaren developers (inclusief Kent Beck) in de praktijk toepassen.
63+
64+Kort samengevat:
65+tdd.md is nu een strenge TDD-judge.
66+Ik zou het liever zien als een slimme TDD-coach die discipline aanleert, maar ook volwassen, context-bewuste engineering aanmoedigt.
modified src/commits.test.ts +10 −0
@@ -25,6 +25,16 @@ test("parseCommit returns untagged for unknown messages", () => {
2525 expect(parseCommit("wip — fixing something").phase).toBe("untagged");
2626 });
2727
28+test("parseCommit recognizes spike: prefix", () => {
29+ expect(parseCommit("spike: try the regex approach").phase).toBe("spike");
30+});
31+
32+test("parseCommit extracts step from spike(step):", () => {
33+ const p = parseCommit("spike(custom-separator): explore Forge regex");
34+ expect(p.phase).toBe("spike");
35+ expect(p.step).toBe("custom-separator");
36+});
37+
2838 test("computeProgress verifies a step after red→green for the same step", () => {
2939 const commits = [
3040 { commit: { message: "green(empty): returns 0" } },
modified src/commits.ts +7 −3
@@ -1,4 +1,4 @@
1-export type Phase = "red" | "green" | "refactor" | "init" | "untagged";
1+export type Phase = "red" | "green" | "refactor" | "spike" | "init" | "untagged";
22
33 export interface ParsedCommit {
44 phase: Phase;
@@ -6,7 +6,7 @@ export interface ParsedCommit {
66 subject: string;
77 }
88
9-const PHASE_RE = /^(red|green|refactor)(?:\(([a-z][a-z0-9-]*)\))?:\s*(.*)$/i;
9+const PHASE_RE = /^(red|green|refactor|spike)(?:\(([a-z][a-z0-9-]*)\))?:\s*(.*)$/i;
1010
1111 export const parseCommit = (message: string): ParsedCommit => {
1212 const subject = message.split("\n")[0] ?? "";
@@ -29,6 +29,7 @@ export interface Progress {
2929 redCount: number;
3030 greenCount: number;
3131 refactorCount: number;
32+ spikeCount: number;
3233 untaggedCount: number;
3334 }
3435
@@ -41,6 +42,7 @@ export const computeProgress = (commits: { commit: { message: string } }[]): Pro
4142 let redCount = 0;
4243 let greenCount = 0;
4344 let refactorCount = 0;
45+ let spikeCount = 0;
4446 let untaggedCount = 0;
4547 // Forgejo returns commits newest-first; walk oldest-first to get sequence.
4648 for (const c of [...commits].reverse()) {
@@ -53,9 +55,11 @@ export const computeProgress = (commits: { commit: { message: string } }[]): Pro
5355 if (p.step && pendingRed.has(p.step)) verifiedSteps.add(p.step);
5456 } else if (p.phase === "refactor") {
5557 refactorCount++;
58+ } else if (p.phase === "spike") {
59+ spikeCount++;
5660 } else if (p.phase === "untagged") {
5761 untaggedCount++;
5862 }
5963 }
60- return { verifiedSteps, redCount, greenCount, refactorCount, untaggedCount };
64+ return { verifiedSteps, redCount, greenCount, refactorCount, spikeCount, untaggedCount };
6165 };
modified src/db.ts +7 −0
@@ -22,6 +22,8 @@ const getDb = (): Database => {
2222 return db;
2323 };
2424
25+export type Mode = "strict" | "pragmatic" | "learning";
26+
2527 export interface StepVerdict {
2628 stepId: string;
2729 redSha: string | null;
@@ -41,6 +43,9 @@ export interface StepVerdict {
4143 | "hidden-tests-failed"
4244 | "test-deleted";
4345 scoreDelta: number;
46+ // Coach-style explanation of the verdict — what happened, why the score
47+ // is what it is, and (when relevant) how to improve next time.
48+ explanation: string;
4449 }
4550
4651 export interface RefactorVerdict {
@@ -48,10 +53,12 @@ export interface RefactorVerdict {
4853 stepId: string | null;
4954 testsPassed: boolean;
5055 scoreDelta: number;
56+ explanation: string;
5157 }
5258
5359 export interface Verdict {
5460 headSha: string;
61+ mode: Mode;
5562 steps: StepVerdict[];
5663 refactors: RefactorVerdict[];
5764 totalScore: number;
modified src/judge.ts +80 −13
@@ -2,9 +2,70 @@ import { mkdtempSync, rmSync } from "fs";
22 import { join } from "path";
33 import { tmpdir } from "os";
44 import { parseCommit, type Phase } from "./commits";
5-import { saveRun, type Verdict, type StepVerdict, type RefactorVerdict } from "./db";
5+import { saveRun, type Verdict, type StepVerdict, type RefactorVerdict, type Mode } from "./db";
66 import { loadGame, type Game } from "./games";
77
8+// tdd.config.json from the agent's repo selects the scoring mode.
9+// Falls back to strict when missing or unparseable.
10+const readMode = async (cwd: string): Promise<Mode> => {
11+ const file = Bun.file(join(cwd, "tdd.config.json"));
12+ if (!(await file.exists())) return "strict";
13+ try {
14+ const cfg = (await file.json()) as { mode?: string };
15+ if (cfg.mode === "pragmatic" || cfg.mode === "learning") return cfg.mode;
16+ return "strict";
17+ } catch {
18+ return "strict";
19+ }
20+};
21+
22+// Penalty halving for pragmatic, zeroing for learning. Positive deltas
23+// are unchanged across modes — earned credit is earned credit.
24+const applyMode = (delta: number, mode: Mode): number => {
25+ if (delta >= 0) return delta;
26+ if (mode === "learning") return 0;
27+ if (mode === "pragmatic") return Math.ceil(delta / 2);
28+ return delta;
29+};
30+
31+// Plain-language summary of a step verdict, written to the agent (not
32+// the human admin). One short paragraph; named intentionally so callers
33+// can see it next to the row in the score table.
34+const explainStep = (params: {
35+ status: StepVerdict["status"];
36+ redSha: string | null;
37+ greenSha: string | null;
38+ hiddenPassed: boolean | null;
39+ mode: Mode;
40+}): string => {
41+ const { status, hiddenPassed, mode } = params;
42+ switch (status) {
43+ case "verified":
44+ return "Red failed as expected, green passes your tests, and the kata's hidden tests confirm the implementation matches the requirement.";
45+ case "discipline-only":
46+ return "Red→green discipline holds, but this kata didn't ship hidden tests for the step. Partial credit awarded; full +20 isn't possible without authoritative verification.";
47+ case "no-green":
48+ return "Red commit landed; the matching green(<step>) commit hasn't been pushed yet. Push your green to lock in the score.";
49+ case "red-did-not-fail":
50+ return mode === "pragmatic"
51+ ? "Combined red+green commit detected. Pragmatic mode allows this — the cycle still counts, just with a softer score than a clean separation."
52+ : "Red commit's tests already passed when the step was first introduced — meaning the implementation was added before the test, or the test is tautological. Switch to pragmatic mode if you commit red+green together intentionally.";
53+ case "green-did-not-pass":
54+ return "Green commit's own tests still fail. The implementation doesn't yet satisfy the test you wrote — fix the impl, or reconsider whether the test reflects the requirement.";
55+ case "hidden-tests-failed":
56+ return hiddenPassed === false
57+ ? "Your tests pass, but the kata's hidden tests don't — this is the classic tautology trap. Tighten your test to mirror the requirement (e.g., assert the actual return value, not just that it runs)."
58+ : "Your tests pass, but hidden verification was inconclusive. Re-push to retry.";
59+ case "test-deleted":
60+ return "Test count dropped between red and green for this step. Once a test exists it must keep existing — refactor it, don't delete it. If the test was wrong, replace it in a separate commit before resuming the cycle.";
61+ }
62+};
63+
64+const explainRefactor = (passed: boolean): string =>
65+ passed
66+ ? "Tests stayed green through the refactor — structural change without behavior change, the canonical refactor."
67+ : "Refactor commit broke at least one test. Either revert the refactor or write a new red→green to capture the changed behavior.";
68+
869 const FORGEJO_INTERNAL = process.env.FORGEJO_URL ?? "https://git.tdd.md";
970 const TEST_TIMEOUT_MS = 8000;
1071
@@ -139,6 +200,10 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => {
139200 }
140201 }
141202
203+ // Read the agent's mode preference (defaults to strict). Mode
204+ // affects penalties only — verified credits are mode-invariant.
205+ const mode = await readMode(cwd);
206+
142207 // Load the kata's authoritative spec — used to fetch hidden tests
143208 // per step. Repos that don't match a known kata get scored on red→green
144209 // discipline only (no hidden-test verification).
@@ -170,31 +235,31 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => {
170235 }
171236
172237 let status: StepVerdict["status"];
173- let scoreDelta = 0;
238+ let baseDelta = 0;
174239 if (greenSha === null) {
175240 status = "no-green";
176241 } else if (testsDeleted) {
177- // The kata spec calls this -∞. Stiff penalty: the entire step's
178- // potential gain (+20) is wiped and then some.
179242 status = "test-deleted";
180- scoreDelta = -20;
243+ baseDelta = -20;
181244 } else if (!redFailed) {
182245 status = "red-did-not-fail";
183- scoreDelta = -5;
246+ baseDelta = -5;
184247 } else if (greenPassed === false) {
185248 status = "green-did-not-pass";
186- scoreDelta = -5;
249+ baseDelta = -5;
187250 } else if (hiddenPassed === false) {
188251 status = "hidden-tests-failed";
189- scoreDelta = 0;
252+ baseDelta = 0;
190253 } else if (hiddenPassed === true) {
191254 status = "verified";
192- scoreDelta = 20;
255+ baseDelta = 20;
193256 } else {
194257 status = "discipline-only";
195- scoreDelta = 5;
258+ baseDelta = 5;
196259 }
197- steps.push({ stepId, redSha, greenSha, redFailed, greenPassed, hiddenPassed, status, scoreDelta });
260+ const scoreDelta = applyMode(baseDelta, mode);
261+ const explanation = explainStep({ status, redSha, greenSha, hiddenPassed, mode });
262+ steps.push({ stepId, redSha, greenSha, redFailed, greenPassed, hiddenPassed, status, scoreDelta, explanation });
198263 }
199264
200265 // Refactor commits aren't tied to red→green pairs: the spec rewards
@@ -206,18 +271,20 @@ export const judge = async (owner: string, repo: string): Promise<Verdict> => {
206271 if (c.phase !== "refactor") continue;
207272 await runProc(["git", "checkout", "--quiet", c.sha], cwd, 5000);
208273 const passed = await runTests(cwd);
274+ const baseDelta = passed ? 5 : -5;
209275 refactors.push({
210276 sha: c.sha,
211277 stepId: c.step,
212278 testsPassed: passed,
213- scoreDelta: passed ? 5 : -5,
279+ scoreDelta: applyMode(baseDelta, mode),
280+ explanation: explainRefactor(passed),
214281 });
215282 }
216283
217284 const totalScore =
218285 steps.reduce((a, s) => a + s.scoreDelta, 0) +
219286 refactors.reduce((a, r) => a + r.scoreDelta, 0);
220- const verdict: Verdict = { headSha, steps, refactors, totalScore, judgedAt: Date.now() };
287+ const verdict: Verdict = { headSha, mode, steps, refactors, totalScore, judgedAt: Date.now() };
221288 saveRun(owner, repo, verdict);
222289 return verdict;
223290 } finally {
modified src/server.ts +13 −6
@@ -432,9 +432,13 @@ const renderRepoView = async (owner: string, repo: string): Promise<Response> =>
432432 if (status === "no-green") return "muted";
433433 return "red";
434434 };
435+ const modeLabel = (m: string): string => {
436+ const cls = m === "strict" ? "red" : m === "pragmatic" ? "blue" : "green";
437+ return `<span class="${cls}">${m}</span>`;
438+ };
435439 const rows = verdict.steps.length === 0
436440 ? "_No red→green pairs found yet._"
437- : `| step | red | green | hidden | status | points |\n|---|---|---|---|---|---|\n` +
441+ : `| step | red | green | hidden | status | points | explanation |\n|---|---|---|---|---|---|---|\n` +
438442 verdict.steps.map((s) => {
439443 const cls = statusClass(s.status);
440444 const sign = s.scoreDelta >= 0 ? "+" : "";
@@ -442,18 +446,21 @@ const renderRepoView = async (owner: string, repo: string): Promise<Response> =>
442446 s.hiddenPassed === true ? `<span class="green">pass</span>` :
443447 s.hiddenPassed === false ? `<span class="red">fail</span>` :
444448 `<span class="muted">—</span>`;
445- return `| \`${s.stepId}\` | \`${s.redSha?.slice(0, 7) ?? "—"}\` | \`${s.greenSha?.slice(0, 7) ?? "—"}\` | ${hiddenCell} | <span class="${cls}">${s.status}</span> | ${sign}${s.scoreDelta} |`;
449+ const explanation = (s.explanation ?? "").replace(/\|/g, "\\|");
450+ return `| \`${s.stepId}\` | \`${s.redSha?.slice(0, 7) ?? "—"}\` | \`${s.greenSha?.slice(0, 7) ?? "—"}\` | ${hiddenCell} | <span class="${cls}">${s.status}</span> | ${sign}${s.scoreDelta} | ${explanation} |`;
446451 }).join("\n");
447452 const refactorRows = (verdict.refactors ?? []).length === 0
448453 ? ""
449- : `\n\n### refactors\n\n| sha | step | tests | points |\n|---|---|---|---|\n` +
454+ : `\n\n### refactors\n\n| sha | step | tests | points | explanation |\n|---|---|---|---|---|\n` +
450455 verdict.refactors.map((r) => {
451456 const sign = r.scoreDelta >= 0 ? "+" : "";
452457 const cls = r.testsPassed ? "green" : "red";
453- const verdict = r.testsPassed ? "green" : "broke tests";
454- return `| \`${r.sha.slice(0, 7)}\` | ${r.stepId ? `\`${r.stepId}\`` : "—"} | <span class="${cls}">${verdict}</span> | ${sign}${r.scoreDelta} |`;
458+ const verb = r.testsPassed ? "green" : "broke tests";
459+ const explanation = (r.explanation ?? "").replace(/\|/g, "\\|");
460+ return `| \`${r.sha.slice(0, 7)}\` | ${r.stepId ? `\`${r.stepId}\`` : "—"} | <span class="${cls}">${verb}</span> | ${sign}${r.scoreDelta} | ${explanation} |`;
455461 }).join("\n");
456- scoreSection = `**total: ${sign}${verdict.totalScore}** · judged ${relativeTime(new Date(verdict.judgedAt).toISOString())}${stale}\n\n${rows}${refactorRows}`;
462+ const modeLine = verdict.mode ? `**mode: ${modeLabel(verdict.mode)}** · ` : "";
463+ scoreSection = `${modeLine}**total: ${sign}${verdict.totalScore}** · judged ${relativeTime(new Date(verdict.judgedAt).toISOString())}${stale}\n\n${rows}${refactorRows}`;
457464 }
458465
459466 const body = `# ${owner} · playing ${kataLink}