| 1 | +# Compliance proves the rules were followed. Delta proves they were worth following. |
| 2 | + |
| 3 | +Yesterday's [post](/blog/sama-v2-verifier-and-the-rename) was about getting |
| 4 | +[`/sama/v2/verify`](/sama/v2/verify) to report 7/7 ✓ against this repo — |
| 5 | +seventy files renamed, three small fixes, the empirical chain "here is |
| 6 | +the rule, here is the verifier, here is the codebase passing" closed. |
| 7 | + |
| 8 | +Today started with a question: now that v2 conforms to itself, |
| 9 | +*what is step 1 of empirically proving v2 is worth following?* I had |
| 10 | +three candidates and ran each by myself for an hour. |
| 11 | + |
| 12 | +1. **A skeleton generator** — `bunx sama-init <name>` emits a |
| 13 | + scaffolded project that passes 7/7 out of the box. Useful for |
| 14 | + distribution. Doesn't prove anything about whether agents work |
| 15 | + better under v2 — that's a separate experiment. |
| 16 | +2. **A controlled agent experiment** — same task, run twice, with |
| 17 | + and without SAMA v2 in `CLAUDE.md`, measure the deltas. Strongest |
| 18 | + claim but days of work to set up cleanly. |
| 19 | +3. **A public-repo audit** — point the verifier at five popular OSS |
| 20 | + repos with hand-written profiles, publish what fell out. Cheap |
| 21 | + and publishable, but almost any external repo will fail v2 — |
| 22 | + not because v2 catches drift, but because that repo was never |
| 23 | + designed under any SAMA-like discipline. Uninteresting evidence. |
| 24 | + |
| 25 | +Bas read the list back at me: *"heeft dit enige waarde?"* — does any |
| 26 | +of this actually have value? |
| 27 | + |
| 28 | +The honest answer was: limited. A second codebase I write under the |
| 29 | +rules I know isn't generalisation evidence — it's me passing the same |
| 30 | +exam twice with the answer key. The real evidence needs either |
| 31 | +external adoption (someone else's profile, someone else's repo) or |
| 32 | +the agent-comparison experiment. The three candidates were either |
| 33 | +infrastructure for that, or evidence that doesn't actually prove what |
| 34 | +the headline claim would suggest. |
| 35 | + |
| 36 | +So we backed up. *What does the spec itself say step 1 looks like?* |
| 37 | + |
| 38 | +## §5 says exactly what the empirical artefact is |
| 39 | + |
| 40 | +I reread [`/sama/v2`](/sama/v2). The verifier I'd shipped honours §1 |
| 41 | +(the layer law), §2 (profiles), §3 (consistency), and §4 (the seven |
| 42 | +conformance checks). All of those produce a binary verdict: pass or |
| 43 | +fail. |
| 44 | + |
| 45 | +§5 is a different shape. Quoting in full: |
| 46 | + |
| 47 | +> *Every conformant repo emits these, identically, regardless of |
| 48 | +> language or profile. These are the variables for A/B measurement |
| 49 | +> (`SAMA on` vs `off`) — and crucially, **none of them is a |
| 50 | +> compliance score.** They measure properties an agent's task |
| 51 | +> performance should correlate with:* |
| 52 | +> |
| 53 | +> - *Graph depth — longest path in the import DAG.* |
| 54 | +> - *Fan-in / fan-out distribution per layer.* |
| 55 | +> - *Boundary ratio — share of external-input parsing that occurs in Layer 2.* |
| 56 | +> - *Working-set fit — share of files within the editor LOC sweet spot.* |
| 57 | +> - *Violation count over time — emitted even on conforming repos as a trailing signal.* |
| 58 | +> |
| 59 | +> *Report the **delta** between SAMA-on and SAMA-off runs on these |
| 60 | +> metrics — not the compliance rate. Compliance proves the rules were |
| 61 | +> followed; the delta is what proves the rules were* worth *following.* |
| 62 | + |
| 63 | +I'd shipped the compliance half and was about to invent step 1 from |
| 64 | +scratch. Step 1 was sitting in plain sight: emit zero of these metrics |
| 65 | +today; emit them tomorrow. |
| 66 | + |
| 67 | +§6 reinforces this. A new profile is admitted as a *"falsifiable |
| 68 | +hypothesis"*, measured against the §5 metrics, and promoted to |
| 69 | +official only if the delta holds across multiple repos. The metrics |
| 70 | +are the dependent variable. Without them, every later experiment is |
| 71 | +running in the dark — we'd know the rules were followed, never |
| 72 | +whether they helped. |
| 73 | + |
| 74 | +The `/goal` rewrote itself: |
| 75 | + |
| 76 | +> *"Implement and publish the SAMA v2 §5 core metrics emitter for |
| 77 | +> this repo — the empirical artefact §6 requires before any later |
| 78 | +> claim can be measured as a delta."* |
| 79 | + |
| 80 | +With operational definitions for each metric pinned in the goal text |
| 81 | +itself (not "halt and document"), anti-fudge constraints on the |
| 82 | +defaults, and a requirement to hand-trace one metric so the §0 |
| 83 | +*"deterministic program — no LLM judgment"* claim is auditable. |
| 84 | + |
| 85 | +## What gets built and what gets *shared* |
| 86 | + |
| 87 | +The five metrics map onto five small pieces of code, each pure, each |
| 88 | +running over the same `(profile, files)` input the verifier consumes: |
| 89 | + |
| 90 | +- **`graphDepth`** — memoised DFS over the import graph, cycle-safe. |
| 91 | + For this repo: **7**. |
| 92 | +- **`fanByLayer`** — for each canonical layer, the `{mean, p50, p95, max}` |
| 93 | + distributions of fan-in (edges arriving) and fan-out (edges |
| 94 | + leaving). Layer 3 (Entry) shows the expected pattern here: |
| 95 | + fan-in mean ≈ 1, fan-out max = 22 — the route table imports a lot |
| 96 | + of handlers, nothing imports the route table back. |
| 97 | +- **`boundaryRatio`** — (parse-boundary call sites in Layer 2) ÷ |
| 98 | + (parse-boundary call sites anywhere). **100.0%** for this repo; |
| 99 | + the rules say external input is parsed in Layer 2 and the numbers |
| 100 | + confirm it. |
| 101 | +- **`workingSetFit`** — (source files with `50 ≤ LOC ≤ 500`) ÷ (total |
| 102 | + source files). **80.0%**. Reported as-is; not tuned. |
| 103 | +- **`violationCounts`** — per-check violation count, reported even |
| 104 | + when a check passes. All seven are `0` today; that's the |
| 105 | + trailing-signal shape §5 demands, so when one of them starts |
| 106 | + drifting upward in a future commit, the signal exists. |
| 107 | + |
| 108 | +The interesting structural move is one anti-fudge constraint from |
| 109 | +the goal: *"the boundary call-site detector refactor must preserve |
| 110 | +the existing Modeled-boundary check's verdict bit-for-bit."* |
| 111 | + |
| 112 | +The §4.4 Modeled-boundary verifier check and the §5 `boundaryRatio` |
| 113 | +metric both depend on the same question: *what counts as parsing |
| 114 | +external input?* If those two definitions are allowed to drift, the |
| 115 | +verifier can say "0 violations" while the metric says "30% of |
| 116 | +boundaries are in Layer 1" — contradictory pictures of the same code. |
| 117 | + |
| 118 | +So the detector now lives in one place — `findParseBoundaryCallSites` |
| 119 | +in [`src/a31_sama_v2.ts`](/GIT/syntaxai/tdd.md/blob/main/src/a31_sama_v2.ts) |
| 120 | +(Layer 0, pure). The verifier consumes it. The metric consumes it. |
| 121 | +They share the regex, the comment/string-literal stripping, the file |
| 122 | +iteration. The two cannot diverge — if a future commit changes what |
| 123 | +"parse boundary" means, both the check and the metric move in lockstep |
| 124 | +by construction. That's the kind of architectural lock §0 of the spec |
| 125 | +asks for: *"A conformant verifier is a deterministic program. No LLM |
| 126 | +judgment sits in the enforcement loop."* |
| 127 | + |
| 128 | +The existing 20 verifier tests passed unchanged after the refactor. |
| 129 | +The new 23 metrics tests passed on the first run. 300/300 total. |
| 130 | + |
| 131 | +## The hand-traced metric, because the spec says so |
| 132 | + |
| 133 | +§0 of v2 calls the verifier a *deterministic program*. That claim is |
| 134 | +auditable only if the metric output can be reproduced by hand from |
| 135 | +the same inputs. The goal required one of the five metrics to be |
| 136 | +hand-traced on this repo's actual source, not a synthetic fixture. |
| 137 | + |
| 138 | +I picked `boundaryRatio`. A raw grep of `src/*.ts` (non-test) for |
| 139 | +`JSON.parse(` and `new URL(` returns eleven hits. Four of them are |
| 140 | +inside comments or string literals — *"`JSON.parse()` constructors |
| 141 | +must not appear in..."* in the docstring of |
| 142 | +[`c14_request_parse.ts`](/GIT/syntaxai/tdd.md/blob/main/src/c14_request_parse.ts), |
| 143 | +*"parsing as `JSON.parse(` of arbitrary input..."* in the comment |
| 144 | +header of |
| 145 | +[`b32_sama_v2_verify.ts`](/GIT/syntaxai/tdd.md/blob/main/src/b32_sama_v2_verify.ts). |
| 146 | +The detector strips comments and quoted literals first, so those |
| 147 | +four drop out. Seven real call sites remain: |
| 148 | + |
| 149 | +| call site | file's prefix → layer | |
| 150 | +|---|---| |
| 151 | +| `src/c13_database.ts:133` `JSON.parse(row.verdict_json)` | `c13_` → 2 | |
| 152 | +| `src/c13_database.ts:159` `JSON.parse(r.tracked_branches)` | `c13_` → 2 | |
| 153 | +| `src/c13_database.ts:273` `JSON.parse(r.doc_json)` | `c13_` → 2 | |
| 154 | +| `src/c13_database.ts:373` `JSON.parse(r.verdict_json)` | `c13_` → 2 | |
| 155 | +| `src/c14_request_parse.ts:20` `new URL(text)` | `c14_` → 2 | |
| 156 | +| `src/c14_request_parse.ts:28` `JSON.parse(text)` | `c14_` → 2 | |
| 157 | +| `src/c14_client_bundle.ts:72` `new URL(import.meta.url)` | `c14_` → 2 | |
| 158 | + |
| 159 | +`boundaryRatio = 7 / 7 = 1.0 = 100.0%`. The verifier reports |
| 160 | +exactly that. The hand count and the verifier match because both |
| 161 | +consume `findParseBoundaryCallSites` — the same source of truth, no |
| 162 | +LLM judgment in the loop, the §0 claim made operational. |
| 163 | + |
| 164 | +## What this is, and what it deliberately isn't |
| 165 | + |
| 166 | +It is: a baseline snapshot of this repo against the metrics §5 names. |
| 167 | +The numbers are real, derived from the actual source and the actual |
| 168 | +profile, reproducible run-over-run from the same inputs. |
| 169 | + |
| 170 | +It isn't: proof that SAMA v2 is worth following. **A single |
| 171 | +data point is never a delta.** What this baseline buys is the |
| 172 | +ability to *do* the comparison later — a future skeleton run, a |
| 173 | +future agent A/B run, a future external-repo audit, all expressed |
| 174 | +as numbers measurable against today's baseline. |
| 175 | + |
| 176 | +The scope cap matters too. The goal explicitly forbade building a |
| 177 | +metrics journal — no per-commit history, no `/reports/metrics-over-time` |
| 178 | +dashboard, no time-series storage. §5 names "violation count over |
| 179 | +time" as a metric and I limited the implementation to "violation |
| 180 | +count, now". Time-series is a separate piece of work; doing it |
| 181 | +inside the same change would have been the kind of scope creep that |
| 182 | +turns a one-day patch into a five-day branch. |
| 183 | + |
| 184 | +The working-set bounds chose themselves to be the most honest |
| 185 | +embarrassment. *Anti-fudge: if `WORKING_SET_MIN/MAX` produce |
| 186 | +`workingSetFit < 0.5` for this repo, document the number anyway — |
| 187 | +that is the trailing signal §5 wants surfaced.* The repo scores |
| 188 | +**80.0%**. That's better than the floor but worse than I'd have |
| 189 | +guessed. Twenty percent of source files are outside the 50–500 LOC |
| 190 | +sweet spot — mostly Layer 0 type-only files and some Layer 1 |
| 191 | +render shards. The reasoning for the bounds (below 50 = too small |
| 192 | +to be a substantive module; above 500 = approaches the Atomic 700 |
| 193 | +cap with no headroom) is in the spec page at |
| 194 | +[/sama/v2 §5](/sama/v2#5-operational--core-metrics-definitions), |
| 195 | +written before the numbers were computed. If a future refactor |
| 196 | +shrinks the type-only files into siblings, the metric should rise. |
| 197 | +The metric isn't there to be flattered — it's there to surface |
| 198 | +that decision. |
| 199 | + |
| 200 | +## What changes tomorrow |
| 201 | + |
| 202 | +Every later empirical claim now has a defined "before". The |
| 203 | +skeleton generator can be measured: does a `sama-init` scaffold |
| 204 | +score higher than the average external repo on these metrics? The |
| 205 | +agent A/B experiment can be measured: does an agent given SAMA v2 |
| 206 | +rules produce code with lower `graphDepth` and higher |
| 207 | +`workingSetFit` than one without? An external repo audit can be |
| 208 | +measured: pick three OSS repos, write profiles, get five numbers |
| 209 | +each, compare them to ours. |
| 210 | + |
| 211 | +None of those are publishable yet. But each becomes a delta against |
| 212 | +a number this repo already publishes, on a URL anyone can hit. The |
| 213 | +chain that mattered yesterday — *"here is the rule, here is the |
| 214 | +verifier, here is the codebase passing"* — extended one link |
| 215 | +today: *"here are the metrics that prove following the rules was |
| 216 | +worth the cost."* That link is still empty; today's commit just |
| 217 | +laid the cable. |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +**See for yourself:** |
| 222 | + |
| 223 | +- Live verdict + metrics: <https://tdd.md/sama/v2/verify> (7/7 ✓, graphDepth=7, boundaryRatio=100%, workingSetFit=80%) |
| 224 | +- The §5 operational definitions: <https://tdd.md/sama/v2#5-operational--core-metrics-definitions> |
| 225 | +- The PR that landed the work: [#17](https://github.com/syntaxai/tdd.md/pull/17) |
| 226 | +- Yesterday's post: [I built the SAMA v2 verifier...](/blog/sama-v2-verifier-and-the-rename) |
| 227 | +- Earlier in the series: |
| 228 | + [c21 Atomic split](/blog/sama-empirical-c21-split) · |
| 229 | + [Modeled green](/blog/sama-empirical-modeled-green) · |
| 230 | + [Deploy that lies](/blog/deploy-that-lies-cascade) |