syntaxai/tdd.md · commit b0576f2

Blog: compliance proves rules followed, delta proves they were worth following

The receipt for today's §5 metrics emitter PR (#17). Walks through the
narrative arc: three weak step-1 candidates (skeleton, agent A/B, public-
repo audit), Bas's "heeft dit enige waarde?" pushback, rereading my own
spec, realising §5 + §6 literally name the empirical artefact I'd
skipped. Then the build: the five metrics, the shared parse-boundary
detector that locks the §4.4 check and the boundaryRatio metric
together by construction, the hand-traced worked example that makes
§0's "deterministic program" claim auditable. Ends on what this
baseline buys (delta-against measurement for any later experiment) and
what it deliberately isn't (proof of worth on its own).

Co-Authored-By: Claude Opus 4.7 <[email protected]>

author: syntaxai <[email protected]>
date: 2026-05-23 14:13:11 +01:00
parent: 0372919
commit: b0576f2eb52ed49b31b95aad5850e454bdd8432e

2 files changed · +236 −0

added content/blog/sama-v2-metrics-emitter.md +230 −0

@@ -0,0 +1,230 @@
	1	+# Compliance proves the rules were followed. Delta proves they were worth following.
	2	+
	3	+Yesterday's [post](/blog/sama-v2-verifier-and-the-rename) was about getting
	4	+[`/sama/v2/verify`](/sama/v2/verify) to report 7/7 ✓ against this repo —
	5	+seventy files renamed, three small fixes, the empirical chain "here is
	6	+the rule, here is the verifier, here is the codebase passing" closed.
	7	+
	8	+Today started with a question: now that v2 conforms to itself,
	9	+what is step 1 of empirically proving v2 is worth following? I had
	10	+three candidates and ran each by myself for an hour.
	11	+
	12	+1. A skeleton generator — `bunx sama-init <name>` emits a
	13	+ scaffolded project that passes 7/7 out of the box. Useful for
	14	+ distribution. Doesn't prove anything about whether agents work
	15	+ better under v2 — that's a separate experiment.
	16	+2. A controlled agent experiment — same task, run twice, with
	17	+ and without SAMA v2 in `CLAUDE.md`, measure the deltas. Strongest
	18	+ claim but days of work to set up cleanly.
	19	+3. A public-repo audit — point the verifier at five popular OSS
	20	+ repos with hand-written profiles, publish what fell out. Cheap
	21	+ and publishable, but almost any external repo will fail v2 —
	22	+ not because v2 catches drift, but because that repo was never
	23	+ designed under any SAMA-like discipline. Uninteresting evidence.
	24	+
	25	+Bas read the list back at me: "heeft dit enige waarde?" — does any
	26	+of this actually have value?
	27	+
	28	+The honest answer was: limited. A second codebase I write under the
	29	+rules I know isn't generalisation evidence — it's me passing the same
	30	+exam twice with the answer key. The real evidence needs either
	31	+external adoption (someone else's profile, someone else's repo) or
	32	+the agent-comparison experiment. The three candidates were either
	33	+infrastructure for that, or evidence that doesn't actually prove what
	34	+the headline claim would suggest.
	35	+
	36	+So we backed up. What does the spec itself say step 1 looks like?
	37	+
	38	+## §5 says exactly what the empirical artefact is
	39	+
	40	+I reread [`/sama/v2`](/sama/v2). The verifier I'd shipped honours §1
	41	+(the layer law), §2 (profiles), §3 (consistency), and §4 (the seven
	42	+conformance checks). All of those produce a binary verdict: pass or
	43	+fail.
	44	+
	45	+§5 is a different shape. Quoting in full:
	46	+
	47	+> *Every conformant repo emits these, identically, regardless of
	48	+> language or profile. These are the variables for A/B measurement
	49	+> (`SAMA on` vs `off`) — and crucially, **none of them is a
	50	+> compliance score.** They measure properties an agent's task
	51	+> performance should correlate with:*
	52	+>
	53	+> - Graph depth — longest path in the import DAG.
	54	+> - Fan-in / fan-out distribution per layer.
	55	+> - Boundary ratio — share of external-input parsing that occurs in Layer 2.
	56	+> - Working-set fit — share of files within the editor LOC sweet spot.
	57	+> - Violation count over time — emitted even on conforming repos as a trailing signal.
	58	+>
	59	+> Report the delta* between SAMA-on and SAMA-off runs on these
	60	+> metrics — not the compliance rate. Compliance proves the rules were
	61	+> followed; the delta is what proves the rules were* worth following.
	62	+
	63	+I'd shipped the compliance half and was about to invent step 1 from
	64	+scratch. Step 1 was sitting in plain sight: emit zero of these metrics
	65	+today; emit them tomorrow.
	66	+
	67	+§6 reinforces this. A new profile is admitted as a *"falsifiable
	68	+hypothesis"*, measured against the §5 metrics, and promoted to
	69	+official only if the delta holds across multiple repos. The metrics
	70	+are the dependent variable. Without them, every later experiment is
	71	+running in the dark — we'd know the rules were followed, never
	72	+whether they helped.
	73	+
	74	+The `/goal` rewrote itself:
	75	+
	76	+> *"Implement and publish the SAMA v2 §5 core metrics emitter for
	77	+> this repo — the empirical artefact §6 requires before any later
	78	+> claim can be measured as a delta."*
	79	+
	80	+With operational definitions for each metric pinned in the goal text
	81	+itself (not "halt and document"), anti-fudge constraints on the
	82	+defaults, and a requirement to hand-trace one metric so the §0
	83	+"deterministic program — no LLM judgment" claim is auditable.
	84	+
	85	+## What gets built and what gets shared
	86	+
	87	+The five metrics map onto five small pieces of code, each pure, each
	88	+running over the same `(profile, files)` input the verifier consumes:
	89	+
	90	+- `graphDepth` — memoised DFS over the import graph, cycle-safe.
	91	+ For this repo: 7.
	92	+- `fanByLayer` — for each canonical layer, the `{mean, p50, p95, max}`
	93	+ distributions of fan-in (edges arriving) and fan-out (edges
	94	+ leaving). Layer 3 (Entry) shows the expected pattern here:
	95	+ fan-in mean ≈ 1, fan-out max = 22 — the route table imports a lot
	96	+ of handlers, nothing imports the route table back.
	97	+- `boundaryRatio` — (parse-boundary call sites in Layer 2) ÷
	98	+ (parse-boundary call sites anywhere). 100.0% for this repo;
	99	+ the rules say external input is parsed in Layer 2 and the numbers
	100	+ confirm it.
	101	+- `workingSetFit` — (source files with `50 ≤ LOC ≤ 500`) ÷ (total
	102	+ source files). 80.0%. Reported as-is; not tuned.
	103	+- `violationCounts` — per-check violation count, reported even
	104	+ when a check passes. All seven are `0` today; that's the
	105	+ trailing-signal shape §5 demands, so when one of them starts
	106	+ drifting upward in a future commit, the signal exists.
	107	+
	108	+The interesting structural move is one anti-fudge constraint from
	109	+the goal: *"the boundary call-site detector refactor must preserve
	110	+the existing Modeled-boundary check's verdict bit-for-bit."*
	111	+
	112	+The §4.4 Modeled-boundary verifier check and the §5 `boundaryRatio`
	113	+metric both depend on the same question: *what counts as parsing
	114	+external input?* If those two definitions are allowed to drift, the
	115	+verifier can say "0 violations" while the metric says "30% of
	116	+boundaries are in Layer 1" — contradictory pictures of the same code.
	117	+
	118	+So the detector now lives in one place — `findParseBoundaryCallSites`
	119	+in [`src/a31_sama_v2.ts`](/GIT/syntaxai/tdd.md/blob/main/src/a31_sama_v2.ts)
	120	+(Layer 0, pure). The verifier consumes it. The metric consumes it.
	121	+They share the regex, the comment/string-literal stripping, the file
	122	+iteration. The two cannot diverge — if a future commit changes what
	123	+"parse boundary" means, both the check and the metric move in lockstep
	124	+by construction. That's the kind of architectural lock §0 of the spec
	125	+asks for: *"A conformant verifier is a deterministic program. No LLM
	126	+judgment sits in the enforcement loop."*
	127	+
	128	+The existing 20 verifier tests passed unchanged after the refactor.
	129	+The new 23 metrics tests passed on the first run. 300/300 total.
	130	+
	131	+## The hand-traced metric, because the spec says so
	132	+
	133	+§0 of v2 calls the verifier a deterministic program. That claim is
	134	+auditable only if the metric output can be reproduced by hand from
	135	+the same inputs. The goal required one of the five metrics to be
	136	+hand-traced on this repo's actual source, not a synthetic fixture.
	137	+
	138	+I picked `boundaryRatio`. A raw grep of `src/*.ts` (non-test) for
	139	+`JSON.parse(` and `new URL(` returns eleven hits. Four of them are
	140	+inside comments or string literals — *"`JSON.parse()` constructors
	141	+must not appear in..."* in the docstring of
	142	+[`c14_request_parse.ts`](/GIT/syntaxai/tdd.md/blob/main/src/c14_request_parse.ts),
	143	+"parsing as `JSON.parse(` of arbitrary input..." in the comment
	144	+header of
	145	+[`b32_sama_v2_verify.ts`](/GIT/syntaxai/tdd.md/blob/main/src/b32_sama_v2_verify.ts).
	146	+The detector strips comments and quoted literals first, so those
	147	+four drop out. Seven real call sites remain:
	148	+
	149	+\| call site \| file's prefix → layer \|
	150	+\|---\|---\|
	151	+\| `src/c13_database.ts:133` `JSON.parse(row.verdict_json)` \| `c13_` → 2 \|
	152	+\| `src/c13_database.ts:159` `JSON.parse(r.tracked_branches)` \| `c13_` → 2 \|
	153	+\| `src/c13_database.ts:273` `JSON.parse(r.doc_json)` \| `c13_` → 2 \|
	154	+\| `src/c13_database.ts:373` `JSON.parse(r.verdict_json)` \| `c13_` → 2 \|
	155	+\| `src/c14_request_parse.ts:20` `new URL(text)` \| `c14_` → 2 \|
	156	+\| `src/c14_request_parse.ts:28` `JSON.parse(text)` \| `c14_` → 2 \|
	157	+\| `src/c14_client_bundle.ts:72` `new URL(import.meta.url)` \| `c14_` → 2 \|
	158	+
	159	+`boundaryRatio = 7 / 7 = 1.0 = 100.0%`. The verifier reports
	160	+exactly that. The hand count and the verifier match because both
	161	+consume `findParseBoundaryCallSites` — the same source of truth, no
	162	+LLM judgment in the loop, the §0 claim made operational.
	163	+
	164	+## What this is, and what it deliberately isn't
	165	+
	166	+It is: a baseline snapshot of this repo against the metrics §5 names.
	167	+The numbers are real, derived from the actual source and the actual
	168	+profile, reproducible run-over-run from the same inputs.
	169	+
	170	+It isn't: proof that SAMA v2 is worth following. **A single
	171	+data point is never a delta.** What this baseline buys is the
	172	+ability to do the comparison later — a future skeleton run, a
	173	+future agent A/B run, a future external-repo audit, all expressed
	174	+as numbers measurable against today's baseline.
	175	+
	176	+The scope cap matters too. The goal explicitly forbade building a
	177	+metrics journal — no per-commit history, no `/reports/metrics-over-time`
	178	+dashboard, no time-series storage. §5 names "violation count over
	179	+time" as a metric and I limited the implementation to "violation
	180	+count, now". Time-series is a separate piece of work; doing it
	181	+inside the same change would have been the kind of scope creep that
	182	+turns a one-day patch into a five-day branch.
	183	+
	184	+The working-set bounds chose themselves to be the most honest
	185	+embarrassment. *Anti-fudge: if `WORKING_SET_MIN/MAX` produce
	186	+`workingSetFit < 0.5` for this repo, document the number anyway —
	187	+that is the trailing signal §5 wants surfaced.* The repo scores
	188	+80.0%. That's better than the floor but worse than I'd have
	189	+guessed. Twenty percent of source files are outside the 50–500 LOC
	190	+sweet spot — mostly Layer 0 type-only files and some Layer 1
	191	+render shards. The reasoning for the bounds (below 50 = too small
	192	+to be a substantive module; above 500 = approaches the Atomic 700
	193	+cap with no headroom) is in the spec page at
	194	+[/sama/v2 §5](/sama/v2#5-operational--core-metrics-definitions),
	195	+written before the numbers were computed. If a future refactor
	196	+shrinks the type-only files into siblings, the metric should rise.
	197	+The metric isn't there to be flattered — it's there to surface
	198	+that decision.
	199	+
	200	+## What changes tomorrow
	201	+
	202	+Every later empirical claim now has a defined "before". The
	203	+skeleton generator can be measured: does a `sama-init` scaffold
	204	+score higher than the average external repo on these metrics? The
	205	+agent A/B experiment can be measured: does an agent given SAMA v2
	206	+rules produce code with lower `graphDepth` and higher
	207	+`workingSetFit` than one without? An external repo audit can be
	208	+measured: pick three OSS repos, write profiles, get five numbers
	209	+each, compare them to ours.
	210	+
	211	+None of those are publishable yet. But each becomes a delta against
	212	+a number this repo already publishes, on a URL anyone can hit. The
	213	+chain that mattered yesterday — *"here is the rule, here is the
	214	+verifier, here is the codebase passing"* — extended one link
	215	+today: *"here are the metrics that prove following the rules was
	216	+worth the cost."* That link is still empty; today's commit just
	217	+laid the cable.
	218	+
	219	+---
	220	+
	221	+See for yourself:
	222	+
	223	+- Live verdict + metrics: <https://tdd.md/sama/v2/verify> (7/7 ✓, graphDepth=7, boundaryRatio=100%, workingSetFit=80%)
	224	+- The §5 operational definitions: <https://tdd.md/sama/v2#5-operational--core-metrics-definitions>
	225	+- The PR that landed the work: [#17](https://github.com/syntaxai/tdd.md/pull/17)
	226	+- Yesterday's post: [I built the SAMA v2 verifier...](/blog/sama-v2-verifier-and-the-rename)
	227	+- Earlier in the series:
	228	+ [c21 Atomic split](/blog/sama-empirical-c21-split) ·
	229	+ [Modeled green](/blog/sama-empirical-modeled-green) ·
	230	+ [Deploy that lies](/blog/deploy-that-lies-cascade)

modified src/a31_blog.ts +6 −0

@@ -12,6 +12,12 @@ export interface BlogEntry {
12	12	}
13	13
14	14	export const ALL_POSTS: BlogEntry[] = [
	15	+ {
	16	+ slug: "sama-v2-metrics-emitter",
	17	+ title: "Compliance proves the rules were followed. Delta proves they were worth following.",
	18	+ description: "Yesterday the v2 verifier said 7/7 ✓ against this repo and the empirical chain — rule, verifier, codebase passing — closed for §4. Today I went looking for step 1 of empirically proving v2 is worth following, ran through three weak candidates (skeleton, agent experiment, external-repo audit), and Bas pushed back: \"heeft dit enige waarde?\" Then I reread my own §5 + §6. The spec literally says compliance ≠ proof; the empirical artefact is the delta on five core metrics — graph depth, fan distribution per layer, boundary ratio, working-set fit, violation counts over time. We emitted zero of them. Built the §5 metrics emitter as one Layer 1 pure module sharing a parse-boundary detector with the §4.4 verifier check (they cannot diverge by construction). Real numbers for this repo: graphDepth=7, boundaryRatio=100%, workingSetFit=80%, violationCounts all zero. Hand-traced boundaryRatio against the seven real call sites to match the verifier's number, because §0 says the program is deterministic and that claim is only auditable if a human can reproduce it. This isn't proof v2 works — a baseline is never proof. It's the cable today's PR laid for tomorrow's delta-comparison work.",
	19	+ date: "2026-05-23",
	20	+ },
15	21	{
16	22	slug: "sama-v2-verifier-and-the-rename",
17	23	title: "I built the SAMA v2 verifier. It told me my own repo wasn't v2-compliant. Then I renamed 70 files.",

raw .diff