syntaxai/tdd.md · main · content / blog / from-rules-to-checks.md

from-rules-to-checks.md 93 lines · 8487 bytes raw · source

From rules to checks: shipping what the corpus post promised

The corpus post closed with a promise: "three of the ten threads describe failures only an actual test run can catch" — and named which checks would have caught which failure modes (some today, some with a small extension, some only with a sandbox runner). This post is the receipt. Three of those checks now ship: placeholder-test detection (the "one-evening sliver"), historical-commit testing via git worktree (the "next slice on the roadmap"), and /sama/verify (mechanical layer grep + sibling-test + line-count + placeholder check, runnable against any public repo).

why this post is short

The two previous posts made an argument. This one documents an outcome. If the argument was right, the outcome should be small and obvious: the rules become checks, the checks become routes, the routes catch the failure modes the corpus catalogued. Here's what that looks like in practice.

1 · placeholder-test detection (caught today)

Failure mode: r/ClaudeCode 1qix264, "Claude wrote 90 placeholder tests and reported 100% pass rate". The corpus post said:

"empty assertion bodies — zero expect() calls, string-literal bodies, single-line // TODO stubs — are AST-checkable. The test bundle already lives in content/git-history/syntaxai__tdd.md__tests.json; an empty-body check is a one-evening sliver."

The check is a regex-based brace walker that extracts every test(...) and it(...) body, counts expect( occurrences, and flags zero-count bodies. It runs at deploy time as part of the existing snapshot-tests.ts script and writes its findings to the bundle as placeholderTests: { name, file, reason }[]. The runtime renderer surfaces them on /reports/live/tests:

  • Zero placeholders → a small "no placeholder tests detected at this snapshot" note explaining what the check looks for.
  • One or more → a flagged section with a per-test table: name, file, reason ("no expect() calls", "empty test body", "comment-only stub").

It catches the most common shape of 1qix264 directly (expect() count is zero). It misses theoretical ones (custom assertion helpers that don't go through expect); the regex's blast radius is the real failures, not every imaginable one.

2 · historical-commit testing (the sandbox runner sliver)

Failure mode: r/ClaudeCode 1rug14a, "Claude wrote Playwright tests that secretly patched the app at runtime". This is the failure that the previous reporting layer couldn't catch — the diff looks fine, the test passes in the agent's terminal, the test passes in the deploy-time bundle too if the bundle only ever ran HEAD. Catching this needs the same test to run somewhere it's never run before, against the actual code at that SHA.

The new mode: SAMA_HISTORY_DEPTH=N in the deploy environment makes the snapshot script also test the last N commits that aren't already in the bundle. Mechanically:

// scripts/p620/snapshot-tests.ts (excerpt)
git worktree add --detach /tmp/tdd-md-wt-<sha> <sha>
ln -s "$REPO_ROOT/node_modules" "$WORKTREE/node_modules"
bun test --reporter=junit --reporter-outfile=/tmp/junit-<sha>.xml
git worktree remove --force /tmp/tdd-md-wt-<sha>

Each historical run produces the same TestRunRecord shape as a HEAD run, gets appended to the bundle keyed by SHA, and feeds the existing stability table. Two consequences:

  • Stability data builds 10× faster. A first SAMA_HISTORY_DEPTH=10 deploy backfills ten runs in one go instead of waiting ten deploys.
  • Runtime-patching becomes detectable in principle. A test that passed in the agent's session AND in the original deploy run, but fails when re-run from a clean worktree at the same SHA, is the smoking-gun shape of 1rug14a. We're not yet wired to flag the discrepancy as a separate failure mode (that's the next sliver), but the data to compare is now in the bundle.

The default is still HISTORY_DEPTH=0 (HEAD-only). Opt-in keeps deploy time bounded; flipping the default to 5 or 10 is a one-line change once we want it on by default.

3 · /sama/verify (mechanical check for any public repo)

The corpus post argued: "don't write a CLAUDE.md instruction the harness can overrule. Write a structural check the harness doesn't get to know about." That argument is hollow if the structural checks aren't actually runnable. The new route closes the loop:

/sama/verify — paste a public GitHub repo, get a four-discipline report. The mechanics:

  1. One GitHub API call to git/trees/<default-branch>?recursive=1 resolves the file list.
  2. Every src/cXX_*.ts file is fetched via raw.githubusercontent.com (no API rate limit, no token).
  3. Pure logic in c32_sama_verify.ts runs the four checks:
    • S — Sorted: every relative from "./..." import in a cXX_*.ts is parsed; flag if the target's prefix is higher than the source's.
    • A — Architecture: every cXX_ prefix is matched against the known set (c11, c13, c14, c21, c31, c32, c51); unknown ones flagged.
    • M — Modeled: every cXX_<name>.ts (non-test) is checked for a sibling cXX_<name>.test.ts. Hard-fails for c32_* (logic); informational for c31_* (often pure-data registries).
    • A — Atomic: line count over 700 → flagged. Test files → run the same placeholder check from sliver #1.

Output: pass/fail per discipline, with up to 20 violations per check listed (file + detail). Cached for an hour per repo.

Try it on this site: /sama/verify?repo=syntaxai/tdd.md. And here's the dogfood result, honestly:

check tdd.md self-verify result
S — Sorted ✓ pass — no UI dependency leaks into foundation/data/logic
A — Architecture ✓ pass — every prefix is in the known set
M — Modeled ✗ 5 violations — c32_judge.ts, c32_session.ts, c32_real_reports.ts, c32_real_tests.ts, c32_sama_verify.ts lack sibling test files
A — Atomic ✗ 1 violation — c21_app.ts is 1066 lines (over the 700-line split threshold)

Two of four fail, and they're real. Five c32_* logic files — including c32_sama_verify.ts, the file that runs the verification — don't have sibling tests yet, and the route dispatcher has grown past the atomic threshold and now needs a per-domain split. Both findings were caught by the tool we just shipped, against the codebase we just shipped it from. That's the dogfood story: not "everything passes" but "the tool catches real things in real code, including its own". Both are on the very next slice of the roadmap.

what this changes about the case

The argument has now happened in three layers:

  1. The harness postmortem post said: structural rules survive harness chaos because they're enforced outside the agent's context window.
  2. The corpus post said: ten threads prove the failure modes are systematic, here are the rules that catch each, here's what we catch and what we don't yet.
  3. This post says: the rules are now checks, the checks are now URLs you can hit, and you can verify the case against any public repo including this one.

The leftover work — flagging a runtime-patching discrepancy as a distinct failure mode, hidden-test verification on real-project commits, AST-level placeholder detection beyond the regex — is in the open. It's smaller than what shipped this week.

tl;dr

The two previous posts made a case from text. This one ships the checks the case promised:

sliver route catches status
placeholder detection /reports/live/tests r/ClaudeCode 1qix264 ("90 placeholder tests, 100% pass") live
historical-commit testing snapshot script with SAMA_HISTORY_DEPTH=N runtime-patching SHAs (groundwork for 1rug14a) opt-in, default 0
/sama/verify /sama/verify layer violations, missing sibling tests, oversized files, placeholder tests, in any public repo live

If the discipline is real, you should be able to point it at a repo and have it report findings. Now you can.

← back to the blog · the four SAMA disciplines → · drop SAMA into your agent → · verify a repo →