syntaxai/tdd.md · commit b0576f2

Blog: compliance proves rules followed, delta proves they were worth following

The receipt for today's §5 metrics emitter PR (#17). Walks through the
narrative arc: three weak step-1 candidates (skeleton, agent A/B, public-
repo audit), Bas's "heeft dit enige waarde?" pushback, rereading my own
spec, realising §5 + §6 literally name the empirical artefact I'd
skipped. Then the build: the five metrics, the shared parse-boundary
detector that locks the §4.4 check and the boundaryRatio metric
together by construction, the hand-traced worked example that makes
§0's "deterministic program" claim auditable. Ends on what this
baseline buys (delta-against measurement for any later experiment) and
what it deliberately isn't (proof of worth on its own).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
author
syntaxai <[email protected]>
date
2026-05-23 14:13:11 +01:00
parent
0372919
commit
b0576f2eb52ed49b31b95aad5850e454bdd8432e

2 files changed · +236 −0

added content/blog/sama-v2-metrics-emitter.md +230 −0
@@ -0,0 +1,230 @@
1+# Compliance proves the rules were followed. Delta proves they were worth following.
2+
3+Yesterday's [post](/blog/sama-v2-verifier-and-the-rename) was about getting
4+[`/sama/v2/verify`](/sama/v2/verify) to report 7/7 ✓ against this repo —
5+seventy files renamed, three small fixes, the empirical chain "here is
6+the rule, here is the verifier, here is the codebase passing" closed.
7+
8+Today started with a question: now that v2 conforms to itself,
9+*what is step 1 of empirically proving v2 is worth following?* I had
10+three candidates and ran each by myself for an hour.
11+
12+1. **A skeleton generator** — `bunx sama-init <name>` emits a
13+ scaffolded project that passes 7/7 out of the box. Useful for
14+ distribution. Doesn't prove anything about whether agents work
15+ better under v2 — that's a separate experiment.
16+2. **A controlled agent experiment** — same task, run twice, with
17+ and without SAMA v2 in `CLAUDE.md`, measure the deltas. Strongest
18+ claim but days of work to set up cleanly.
19+3. **A public-repo audit** — point the verifier at five popular OSS
20+ repos with hand-written profiles, publish what fell out. Cheap
21+ and publishable, but almost any external repo will fail v2 —
22+ not because v2 catches drift, but because that repo was never
23+ designed under any SAMA-like discipline. Uninteresting evidence.
24+
25+Bas read the list back at me: *"heeft dit enige waarde?"* — does any
26+of this actually have value?
27+
28+The honest answer was: limited. A second codebase I write under the
29+rules I know isn't generalisation evidence — it's me passing the same
30+exam twice with the answer key. The real evidence needs either
31+external adoption (someone else's profile, someone else's repo) or
32+the agent-comparison experiment. The three candidates were either
33+infrastructure for that, or evidence that doesn't actually prove what
34+the headline claim would suggest.
35+
36+So we backed up. *What does the spec itself say step 1 looks like?*
37+
38+## §5 says exactly what the empirical artefact is
39+
40+I reread [`/sama/v2`](/sama/v2). The verifier I'd shipped honours §1
41+(the layer law), §2 (profiles), §3 (consistency), and §4 (the seven
42+conformance checks). All of those produce a binary verdict: pass or
43+fail.
44+
45+§5 is a different shape. Quoting in full:
46+
47+> *Every conformant repo emits these, identically, regardless of
48+> language or profile. These are the variables for A/B measurement
49+> (`SAMA on` vs `off`) — and crucially, **none of them is a
50+> compliance score.** They measure properties an agent's task
51+> performance should correlate with:*
52+>
53+> - *Graph depth — longest path in the import DAG.*
54+> - *Fan-in / fan-out distribution per layer.*
55+> - *Boundary ratio — share of external-input parsing that occurs in Layer 2.*
56+> - *Working-set fit — share of files within the editor LOC sweet spot.*
57+> - *Violation count over time — emitted even on conforming repos as a trailing signal.*
58+>
59+> *Report the **delta** between SAMA-on and SAMA-off runs on these
60+> metrics — not the compliance rate. Compliance proves the rules were
61+> followed; the delta is what proves the rules were* worth *following.*
62+
63+I'd shipped the compliance half and was about to invent step 1 from
64+scratch. Step 1 was sitting in plain sight: emit zero of these metrics
65+today; emit them tomorrow.
66+
67+§6 reinforces this. A new profile is admitted as a *"falsifiable
68+hypothesis"*, measured against the §5 metrics, and promoted to
69+official only if the delta holds across multiple repos. The metrics
70+are the dependent variable. Without them, every later experiment is
71+running in the dark — we'd know the rules were followed, never
72+whether they helped.
73+
74+The `/goal` rewrote itself:
75+
76+> *"Implement and publish the SAMA v2 §5 core metrics emitter for
77+> this repo — the empirical artefact §6 requires before any later
78+> claim can be measured as a delta."*
79+
80+With operational definitions for each metric pinned in the goal text
81+itself (not "halt and document"), anti-fudge constraints on the
82+defaults, and a requirement to hand-trace one metric so the §0
83+*"deterministic program — no LLM judgment"* claim is auditable.
84+
85+## What gets built and what gets *shared*
86+
87+The five metrics map onto five small pieces of code, each pure, each
88+running over the same `(profile, files)` input the verifier consumes:
89+
90+- **`graphDepth`** — memoised DFS over the import graph, cycle-safe.
91+ For this repo: **7**.
92+- **`fanByLayer`** — for each canonical layer, the `{mean, p50, p95, max}`
93+ distributions of fan-in (edges arriving) and fan-out (edges
94+ leaving). Layer 3 (Entry) shows the expected pattern here:
95+ fan-in mean ≈ 1, fan-out max = 22 — the route table imports a lot
96+ of handlers, nothing imports the route table back.
97+- **`boundaryRatio`** — (parse-boundary call sites in Layer 2) ÷
98+ (parse-boundary call sites anywhere). **100.0%** for this repo;
99+ the rules say external input is parsed in Layer 2 and the numbers
100+ confirm it.
101+- **`workingSetFit`** — (source files with `50 ≤ LOC ≤ 500`) ÷ (total
102+ source files). **80.0%**. Reported as-is; not tuned.
103+- **`violationCounts`** — per-check violation count, reported even
104+ when a check passes. All seven are `0` today; that's the
105+ trailing-signal shape §5 demands, so when one of them starts
106+ drifting upward in a future commit, the signal exists.
107+
108+The interesting structural move is one anti-fudge constraint from
109+the goal: *"the boundary call-site detector refactor must preserve
110+the existing Modeled-boundary check's verdict bit-for-bit."*
111+
112+The §4.4 Modeled-boundary verifier check and the §5 `boundaryRatio`
113+metric both depend on the same question: *what counts as parsing
114+external input?* If those two definitions are allowed to drift, the
115+verifier can say "0 violations" while the metric says "30% of
116+boundaries are in Layer 1" — contradictory pictures of the same code.
117+
118+So the detector now lives in one place — `findParseBoundaryCallSites`
119+in [`src/a31_sama_v2.ts`](/GIT/syntaxai/tdd.md/blob/main/src/a31_sama_v2.ts)
120+(Layer 0, pure). The verifier consumes it. The metric consumes it.
121+They share the regex, the comment/string-literal stripping, the file
122+iteration. The two cannot diverge — if a future commit changes what
123+"parse boundary" means, both the check and the metric move in lockstep
124+by construction. That's the kind of architectural lock §0 of the spec
125+asks for: *"A conformant verifier is a deterministic program. No LLM
126+judgment sits in the enforcement loop."*
127+
128+The existing 20 verifier tests passed unchanged after the refactor.
129+The new 23 metrics tests passed on the first run. 300/300 total.
130+
131+## The hand-traced metric, because the spec says so
132+
133+§0 of v2 calls the verifier a *deterministic program*. That claim is
134+auditable only if the metric output can be reproduced by hand from
135+the same inputs. The goal required one of the five metrics to be
136+hand-traced on this repo's actual source, not a synthetic fixture.
137+
138+I picked `boundaryRatio`. A raw grep of `src/*.ts` (non-test) for
139+`JSON.parse(` and `new URL(` returns eleven hits. Four of them are
140+inside comments or string literals — *"`JSON.parse()` constructors
141+must not appear in..."* in the docstring of
142+[`c14_request_parse.ts`](/GIT/syntaxai/tdd.md/blob/main/src/c14_request_parse.ts),
143+*"parsing as `JSON.parse(` of arbitrary input..."* in the comment
144+header of
145+[`b32_sama_v2_verify.ts`](/GIT/syntaxai/tdd.md/blob/main/src/b32_sama_v2_verify.ts).
146+The detector strips comments and quoted literals first, so those
147+four drop out. Seven real call sites remain:
148+
149+| call site | file's prefix → layer |
150+|---|---|
151+| `src/c13_database.ts:133` `JSON.parse(row.verdict_json)` | `c13_` → 2 |
152+| `src/c13_database.ts:159` `JSON.parse(r.tracked_branches)` | `c13_` → 2 |
153+| `src/c13_database.ts:273` `JSON.parse(r.doc_json)` | `c13_` → 2 |
154+| `src/c13_database.ts:373` `JSON.parse(r.verdict_json)` | `c13_` → 2 |
155+| `src/c14_request_parse.ts:20` `new URL(text)` | `c14_` → 2 |
156+| `src/c14_request_parse.ts:28` `JSON.parse(text)` | `c14_` → 2 |
157+| `src/c14_client_bundle.ts:72` `new URL(import.meta.url)` | `c14_` → 2 |
158+
159+`boundaryRatio = 7 / 7 = 1.0 = 100.0%`. The verifier reports
160+exactly that. The hand count and the verifier match because both
161+consume `findParseBoundaryCallSites` — the same source of truth, no
162+LLM judgment in the loop, the §0 claim made operational.
163+
164+## What this is, and what it deliberately isn't
165+
166+It is: a baseline snapshot of this repo against the metrics §5 names.
167+The numbers are real, derived from the actual source and the actual
168+profile, reproducible run-over-run from the same inputs.
169+
170+It isn't: proof that SAMA v2 is worth following. **A single
171+data point is never a delta.** What this baseline buys is the
172+ability to *do* the comparison later — a future skeleton run, a
173+future agent A/B run, a future external-repo audit, all expressed
174+as numbers measurable against today's baseline.
175+
176+The scope cap matters too. The goal explicitly forbade building a
177+metrics journal — no per-commit history, no `/reports/metrics-over-time`
178+dashboard, no time-series storage. §5 names "violation count over
179+time" as a metric and I limited the implementation to "violation
180+count, now". Time-series is a separate piece of work; doing it
181+inside the same change would have been the kind of scope creep that
182+turns a one-day patch into a five-day branch.
183+
184+The working-set bounds chose themselves to be the most honest
185+embarrassment. *Anti-fudge: if `WORKING_SET_MIN/MAX` produce
186+`workingSetFit < 0.5` for this repo, document the number anyway —
187+that is the trailing signal §5 wants surfaced.* The repo scores
188+**80.0%**. That's better than the floor but worse than I'd have
189+guessed. Twenty percent of source files are outside the 50–500 LOC
190+sweet spot — mostly Layer 0 type-only files and some Layer 1
191+render shards. The reasoning for the bounds (below 50 = too small
192+to be a substantive module; above 500 = approaches the Atomic 700
193+cap with no headroom) is in the spec page at
194+[/sama/v2 §5](/sama/v2#5-operational--core-metrics-definitions),
195+written before the numbers were computed. If a future refactor
196+shrinks the type-only files into siblings, the metric should rise.
197+The metric isn't there to be flattered — it's there to surface
198+that decision.
199+
200+## What changes tomorrow
201+
202+Every later empirical claim now has a defined "before". The
203+skeleton generator can be measured: does a `sama-init` scaffold
204+score higher than the average external repo on these metrics? The
205+agent A/B experiment can be measured: does an agent given SAMA v2
206+rules produce code with lower `graphDepth` and higher
207+`workingSetFit` than one without? An external repo audit can be
208+measured: pick three OSS repos, write profiles, get five numbers
209+each, compare them to ours.
210+
211+None of those are publishable yet. But each becomes a delta against
212+a number this repo already publishes, on a URL anyone can hit. The
213+chain that mattered yesterday — *"here is the rule, here is the
214+verifier, here is the codebase passing"* — extended one link
215+today: *"here are the metrics that prove following the rules was
216+worth the cost."* That link is still empty; today's commit just
217+laid the cable.
218+
219+---
220+
221+**See for yourself:**
222+
223+- Live verdict + metrics: <https://tdd.md/sama/v2/verify> (7/7 ✓, graphDepth=7, boundaryRatio=100%, workingSetFit=80%)
224+- The §5 operational definitions: <https://tdd.md/sama/v2#5-operational--core-metrics-definitions>
225+- The PR that landed the work: [#17](https://github.com/syntaxai/tdd.md/pull/17)
226+- Yesterday's post: [I built the SAMA v2 verifier...](/blog/sama-v2-verifier-and-the-rename)
227+- Earlier in the series:
228+ [c21 Atomic split](/blog/sama-empirical-c21-split) ·
229+ [Modeled green](/blog/sama-empirical-modeled-green) ·
230+ [Deploy that lies](/blog/deploy-that-lies-cascade)
modified src/a31_blog.ts +6 −0
@@ -12,6 +12,12 @@ export interface BlogEntry {
1212 }
1313
1414 export const ALL_POSTS: BlogEntry[] = [
15+ {
16+ slug: "sama-v2-metrics-emitter",
17+ title: "Compliance proves the rules were followed. Delta proves they were worth following.",
18+ description: "Yesterday the v2 verifier said 7/7 ✓ against this repo and the empirical chain — rule, verifier, codebase passing — closed for §4. Today I went looking for step 1 of empirically proving v2 is worth following, ran through three weak candidates (skeleton, agent experiment, external-repo audit), and Bas pushed back: \"heeft dit enige waarde?\" Then I reread my own §5 + §6. The spec literally says compliance ≠ proof; the empirical artefact is the delta on five core metrics — graph depth, fan distribution per layer, boundary ratio, working-set fit, violation counts over time. We emitted zero of them. Built the §5 metrics emitter as one Layer 1 pure module sharing a parse-boundary detector with the §4.4 verifier check (they cannot diverge by construction). Real numbers for this repo: graphDepth=7, boundaryRatio=100%, workingSetFit=80%, violationCounts all zero. Hand-traced boundaryRatio against the seven real call sites to match the verifier's number, because §0 says the program is deterministic and that claim is only auditable if a human can reproduce it. This isn't proof v2 works — a baseline is never proof. It's the cable today's PR laid for tomorrow's delta-comparison work.",
19+ date: "2026-05-23",
20+ },
1521 {
1622 slug: "sama-v2-verifier-and-the-rename",
1723 title: "I built the SAMA v2 verifier. It told me my own repo wasn't v2-compliant. Then I renamed 70 files.",