syntaxai/tdd.md · main · content / blog / sama-empirical-modeled-green.md

sama-empirical-modeled-green.md 144 lines · 6065 bytes raw · source

Greening our own dogfood: four sibling tests, the live verifier flipped from 3/4 to 4/4

The strongest claim a coding standard can make is "I follow my own rules on my own codebase, in public, on the page where I document the rules." The weakest version of that claim is the version you'd make about your own work where nobody can check. Halfway is a /sama/verify?repo=mine URL that anyone can hit — and that says three of four pillars green.

That's where this site was yesterday. Today it's 4/4. Here's the receipt.

The dogfood URL

/sama/verify is the public verifier on tdd.md. Drop in any GitHub owner/name and it runs the four SAMA checks against the default branch — same checks, same code, as the local sama check CLI. The "verify any public repo" link on the SAMA landing page seeds it with this site's own slug as an example:

https://tdd.md/sama/verify?repo=syntaxai/tdd.md

Yesterday that page showed:

Sorted       ✓ pass
Architecture ✓ pass
Modeled      ✗ 4 violations
Atomic       ✓ pass

The four Modeled violations were real. SAMA's Modeled rule is hard: every c32_*.ts file must have a sibling .test.ts. Four files were missing one:

c32_judge.ts        — no sibling test file at c32_judge.test.ts
c32_session.ts      — no sibling test file at c32_session.test.ts
c32_real_reports.ts — no sibling test file at c32_real_reports.test.ts
c32_real_tests.ts   — no sibling test file at c32_real_tests.test.ts

They'd been flagged on every check for weeks. Long enough that the "we follow our own rules" narrative was starting to wear thin every time someone pasted the dogfood URL into a conversation.

What I did about it

Four sibling test files. One per violation. 55 new tests, all real unit tests with real expect() calls — the verifier flags placeholder tests too (zero expect() calls in a test() body) under the same Atomic pass, so cheating with empty bodies isn't an option.

File Tests What's actually covered
c32_session.test.ts 24 The pure helpers: parseCookies, timingSafeEqual, hmacSha256Hex, sessionCookieHeader, randomHex, plus the full signSessionverifySession round-trip including forged-signature and expired-cookie rejection paths
c32_judge.test.ts 9 applyMode — the strict / pragmatic / learning penalty math (positive deltas pass through; pragmatic halves negatives; learning zeroes them); explainRefactor — the two-branch explanation strings
c32_real_reports.test.ts 12 detectAgent (Claude / Cursor / Aider attribution from commit footers, case-insensitive); buildTrend (30-day daily commit sparkline — out-of-window drop, same-day stacking, empty input flat-lines)
c32_real_tests.test.ts 10 detectAgent again (this one returns null instead of "unknown" — documented difference); shortenTestLabel

c32_session.ts was the easy one — every helper it ships is already exported. The other three needed a small visibility move: three const helpers (applyMode, explainRefactor, detectAgent, buildTrend, shortenTestLabel) had to become export const so the sibling could import them. No behaviour changed. The exports are the test surface SAMA's Modeled rule expects c32 files to have anyway — burying pure helpers as private const defeats the whole point of putting them in a c32 file in the first place.

What the verifier sees now

Local run:

$ bun scripts/sama-cli.ts check
SAMA verify · (local)/src · (working tree)
  examined 71 SAMA files / 16 tests / 74 src files

  S — Sorted: ✓ pass (71 files)
  A — Architecture: ✓ pass (71 files)
  M — Modeled: ✓ pass (55 files)
  A — Atomic: ✓ pass (71 files)

✓ all four checks passed

Live run — same code, just over HTTP:

$ curl -s 'https://tdd.md/sama/verify?repo=syntaxai/tdd.md' \
    | grep -oE '(Sorted|Architecture|Modeled|Atomic)[^<]*<'
Architecture · ✓ pass<
Atomic · ✓ pass<
Modeled · ✓ pass<
Sorted · ✓ pass<

Same answer, two delivery surfaces, one source of truth.

Why this matters as evidence

Two things are true that weren't true yesterday:

  1. The publicly-hosted verifier shows this codebase clean. Anyone can run https://tdd.md/sama/verify?repo=syntaxai/tdd.md and see 4/4 ✓. Skeptics don't have to trust me; they can hit the URL.
  2. The fix shape matched the verifier's instruction exactly. The verifier said "no sibling test file at <path>." The fix was four sibling test files at those four paths. No metaprogramming, no reorganisation, no carve-out. The instruction was the diff.

That's the empirical loop SAMA's pitch turns on: the verifier names the missing artifact, the operator (human or agent) produces it, the verifier flips green. No judgement calls, no taste, no escalation. On a working codebase that you're shipping to users, you can run the loop in an afternoon.

What this doesn't prove

It doesn't prove the standard scales. It doesn't prove the standard works on Python or Rust. It doesn't prove AI agents using SAMA write better code than agents not using SAMA (that's issue #1 — 360 measured data points, still to come).

It proves one thing: on this codebase, the rules described in /sama/* are the rules running in bun scripts/sama-cli.ts check, which are the rules running on https://tdd.md/sama/verify, and the codebase passes all four of them. The website is the spec is the verifier is the test suite, and they agree on the same answer.

That's the smallest unit of "this standard isn't just a blog post." The blog post is the receipt; the URL is the proof.


See for yourself: