b0576f2eb52ed49b31b95aad5850e454bdd8432e
diff --git a/content/blog/sama-v2-metrics-emitter.md b/content/blog/sama-v2-metrics-emitter.md
new file mode 100644
index 0000000000000000000000000000000000000000..ad38fa25276eb9b2b6af2288048f3bbb5f578939
--- /dev/null
+++ b/content/blog/sama-v2-metrics-emitter.md
@@ -0,0 +1,230 @@
+# Compliance proves the rules were followed. Delta proves they were worth following.
+
+Yesterday's [post](/blog/sama-v2-verifier-and-the-rename) was about getting
+[`/sama/v2/verify`](/sama/v2/verify) to report 7/7 ✓ against this repo —
+seventy files renamed, three small fixes, the empirical chain "here is
+the rule, here is the verifier, here is the codebase passing" closed.
+
+Today started with a question: now that v2 conforms to itself,
+*what is step 1 of empirically proving v2 is worth following?* I had
+three candidates and ran each by myself for an hour.
+
+1. **A skeleton generator** — `bunx sama-init <name>` emits a
+   scaffolded project that passes 7/7 out of the box. Useful for
+   distribution. Doesn't prove anything about whether agents work
+   better under v2 — that's a separate experiment.
+2. **A controlled agent experiment** — same task, run twice, with
+   and without SAMA v2 in `CLAUDE.md`, measure the deltas. Strongest
+   claim but days of work to set up cleanly.
+3. **A public-repo audit** — point the verifier at five popular OSS
+   repos with hand-written profiles, publish what fell out. Cheap
+   and publishable, but almost any external repo will fail v2 —
+   not because v2 catches drift, but because that repo was never
+   designed under any SAMA-like discipline. Uninteresting evidence.
+
+Bas read the list back at me: *"heeft dit enige waarde?"* — does any
+of this actually have value?
+
+The honest answer was: limited. A second codebase I write under the
+rules I know isn't generalisation evidence — it's me passing the same
+exam twice with the answer key. The real evidence needs either
+external adoption (someone else's profile, someone else's repo) or
+the agent-comparison experiment. The three candidates were either
+infrastructure for that, or evidence that doesn't actually prove what
+the headline claim would suggest.
+
+So we backed up. *What does the spec itself say step 1 looks like?*
+
+## §5 says exactly what the empirical artefact is
+
+I reread [`/sama/v2`](/sama/v2). The verifier I'd shipped honours §1
+(the layer law), §2 (profiles), §3 (consistency), and §4 (the seven
+conformance checks). All of those produce a binary verdict: pass or
+fail.
+
+§5 is a different shape. Quoting in full:
+
+> *Every conformant repo emits these, identically, regardless of
+> language or profile. These are the variables for A/B measurement
+> (`SAMA on` vs `off`) — and crucially, **none of them is a
+> compliance score.** They measure properties an agent's task
+> performance should correlate with:*
+>
+> - *Graph depth — longest path in the import DAG.*
+> - *Fan-in / fan-out distribution per layer.*
+> - *Boundary ratio — share of external-input parsing that occurs in Layer 2.*
+> - *Working-set fit — share of files within the editor LOC sweet spot.*
+> - *Violation count over time — emitted even on conforming repos as a trailing signal.*
+>
+> *Report the **delta** between SAMA-on and SAMA-off runs on these
+> metrics — not the compliance rate. Compliance proves the rules were
+> followed; the delta is what proves the rules were* worth *following.*
+
+I'd shipped the compliance half and was about to invent step 1 from
+scratch. Step 1 was sitting in plain sight: emit zero of these metrics
+today; emit them tomorrow.
+
+§6 reinforces this. A new profile is admitted as a *"falsifiable
+hypothesis"*, measured against the §5 metrics, and promoted to
+official only if the delta holds across multiple repos. The metrics
+are the dependent variable. Without them, every later experiment is
+running in the dark — we'd know the rules were followed, never
+whether they helped.
+
+The `/goal` rewrote itself:
+
+> *"Implement and publish the SAMA v2 §5 core metrics emitter for
+> this repo — the empirical artefact §6 requires before any later
+> claim can be measured as a delta."*
+
+With operational definitions for each metric pinned in the goal text
+itself (not "halt and document"), anti-fudge constraints on the
+defaults, and a requirement to hand-trace one metric so the §0
+*"deterministic program — no LLM judgment"* claim is auditable.
+
+## What gets built and what gets *shared*
+
+The five metrics map onto five small pieces of code, each pure, each
+running over the same `(profile, files)` input the verifier consumes:
+
+- **`graphDepth`** — memoised DFS over the import graph, cycle-safe.
+  For this repo: **7**.
+- **`fanByLayer`** — for each canonical layer, the `{mean, p50, p95, max}`
+  distributions of fan-in (edges arriving) and fan-out (edges
+  leaving). Layer 3 (Entry) shows the expected pattern here:
+  fan-in mean ≈ 1, fan-out max = 22 — the route table imports a lot
+  of handlers, nothing imports the route table back.
+- **`boundaryRatio`** — (parse-boundary call sites in Layer 2) ÷
+  (parse-boundary call sites anywhere). **100.0%** for this repo;
+  the rules say external input is parsed in Layer 2 and the numbers
+  confirm it.
+- **`workingSetFit`** — (source files with `50 ≤ LOC ≤ 500`) ÷ (total
+  source files). **80.0%**. Reported as-is; not tuned.
+- **`violationCounts`** — per-check violation count, reported even
+  when a check passes. All seven are `0` today; that's the
+  trailing-signal shape §5 demands, so when one of them starts
+  drifting upward in a future commit, the signal exists.
+
+The interesting structural move is one anti-fudge constraint from
+the goal: *"the boundary call-site detector refactor must preserve
+the existing Modeled-boundary check's verdict bit-for-bit."*
+
+The §4.4 Modeled-boundary verifier check and the §5 `boundaryRatio`
+metric both depend on the same question: *what counts as parsing
+external input?* If those two definitions are allowed to drift, the
+verifier can say "0 violations" while the metric says "30% of
+boundaries are in Layer 1" — contradictory pictures of the same code.
+
+So the detector now lives in one place — `findParseBoundaryCallSites`
+in [`src/a31_sama_v2.ts`](/GIT/syntaxai/tdd.md/blob/main/src/a31_sama_v2.ts)
+(Layer 0, pure). The verifier consumes it. The metric consumes it.
+They share the regex, the comment/string-literal stripping, the file
+iteration. The two cannot diverge — if a future commit changes what
+"parse boundary" means, both the check and the metric move in lockstep
+by construction. That's the kind of architectural lock §0 of the spec
+asks for: *"A conformant verifier is a deterministic program. No LLM
+judgment sits in the enforcement loop."*
+
+The existing 20 verifier tests passed unchanged after the refactor.
+The new 23 metrics tests passed on the first run. 300/300 total.
+
+## The hand-traced metric, because the spec says so
+
+§0 of v2 calls the verifier a *deterministic program*. That claim is
+auditable only if the metric output can be reproduced by hand from
+the same inputs. The goal required one of the five metrics to be
+hand-traced on this repo's actual source, not a synthetic fixture.
+
+I picked `boundaryRatio`. A raw grep of `src/*.ts` (non-test) for
+`JSON.parse(` and `new URL(` returns eleven hits. Four of them are
+inside comments or string literals — *"`JSON.parse()` constructors
+must not appear in..."* in the docstring of
+[`c14_request_parse.ts`](/GIT/syntaxai/tdd.md/blob/main/src/c14_request_parse.ts),
+*"parsing as `JSON.parse(` of arbitrary input..."* in the comment
+header of
+[`b32_sama_v2_verify.ts`](/GIT/syntaxai/tdd.md/blob/main/src/b32_sama_v2_verify.ts).
+The detector strips comments and quoted literals first, so those
+four drop out. Seven real call sites remain:
+
+| call site | file's prefix → layer |
+|---|---|
+| `src/c13_database.ts:133` `JSON.parse(row.verdict_json)` | `c13_` → 2 |
+| `src/c13_database.ts:159` `JSON.parse(r.tracked_branches)` | `c13_` → 2 |
+| `src/c13_database.ts:273` `JSON.parse(r.doc_json)` | `c13_` → 2 |
+| `src/c13_database.ts:373` `JSON.parse(r.verdict_json)` | `c13_` → 2 |
+| `src/c14_request_parse.ts:20` `new URL(text)` | `c14_` → 2 |
+| `src/c14_request_parse.ts:28` `JSON.parse(text)` | `c14_` → 2 |
+| `src/c14_client_bundle.ts:72` `new URL(import.meta.url)` | `c14_` → 2 |
+
+`boundaryRatio = 7 / 7 = 1.0 = 100.0%`. The verifier reports
+exactly that. The hand count and the verifier match because both
+consume `findParseBoundaryCallSites` — the same source of truth, no
+LLM judgment in the loop, the §0 claim made operational.
+
+## What this is, and what it deliberately isn't
+
+It is: a baseline snapshot of this repo against the metrics §5 names.
+The numbers are real, derived from the actual source and the actual
+profile, reproducible run-over-run from the same inputs.
+
+It isn't: proof that SAMA v2 is worth following. **A single
+data point is never a delta.** What this baseline buys is the
+ability to *do* the comparison later — a future skeleton run, a
+future agent A/B run, a future external-repo audit, all expressed
+as numbers measurable against today's baseline.
+
+The scope cap matters too. The goal explicitly forbade building a
+metrics journal — no per-commit history, no `/reports/metrics-over-time`
+dashboard, no time-series storage. §5 names "violation count over
+time" as a metric and I limited the implementation to "violation
+count, now". Time-series is a separate piece of work; doing it
+inside the same change would have been the kind of scope creep that
+turns a one-day patch into a five-day branch.
+
+The working-set bounds chose themselves to be the most honest
+embarrassment. *Anti-fudge: if `WORKING_SET_MIN/MAX` produce
+`workingSetFit < 0.5` for this repo, document the number anyway —
+that is the trailing signal §5 wants surfaced.* The repo scores
+**80.0%**. That's better than the floor but worse than I'd have
+guessed. Twenty percent of source files are outside the 50–500 LOC
+sweet spot — mostly Layer 0 type-only files and some Layer 1
+render shards. The reasoning for the bounds (below 50 = too small
+to be a substantive module; above 500 = approaches the Atomic 700
+cap with no headroom) is in the spec page at
+[/sama/v2 §5](/sama/v2#5-operational--core-metrics-definitions),
+written before the numbers were computed. If a future refactor
+shrinks the type-only files into siblings, the metric should rise.
+The metric isn't there to be flattered — it's there to surface
+that decision.
+
+## What changes tomorrow
+
+Every later empirical claim now has a defined "before". The
+skeleton generator can be measured: does a `sama-init` scaffold
+score higher than the average external repo on these metrics? The
+agent A/B experiment can be measured: does an agent given SAMA v2
+rules produce code with lower `graphDepth` and higher
+`workingSetFit` than one without? An external repo audit can be
+measured: pick three OSS repos, write profiles, get five numbers
+each, compare them to ours.
+
+None of those are publishable yet. But each becomes a delta against
+a number this repo already publishes, on a URL anyone can hit. The
+chain that mattered yesterday — *"here is the rule, here is the
+verifier, here is the codebase passing"* — extended one link
+today: *"here are the metrics that prove following the rules was
+worth the cost."* That link is still empty; today's commit just
+laid the cable.
+
+---
+
+**See for yourself:**
+
+- Live verdict + metrics: <https://tdd.md/sama/v2/verify> (7/7 ✓, graphDepth=7, boundaryRatio=100%, workingSetFit=80%)
+- The §5 operational definitions: <https://tdd.md/sama/v2#5-operational--core-metrics-definitions>
+- The PR that landed the work: [#17](https://github.com/syntaxai/tdd.md/pull/17)
+- Yesterday's post: [I built the SAMA v2 verifier...](/blog/sama-v2-verifier-and-the-rename)
+- Earlier in the series:
+  [c21 Atomic split](/blog/sama-empirical-c21-split) ·
+  [Modeled green](/blog/sama-empirical-modeled-green) ·
+  [Deploy that lies](/blog/deploy-that-lies-cascade)
diff --git a/src/a31_blog.ts b/src/a31_blog.ts
index e6d9427491cd91e4409391a4c71fbb23ee25459c..9fa42f6fd7e8d3e37c408612be7ac1087cd99a9b 100644
--- a/src/a31_blog.ts
+++ b/src/a31_blog.ts
@@ -12,6 +12,12 @@ export interface BlogEntry {
 }
 
 export const ALL_POSTS: BlogEntry[] = [
+  {
+    slug: "sama-v2-metrics-emitter",
+    title: "Compliance proves the rules were followed. Delta proves they were worth following.",
+    description: "Yesterday the v2 verifier said 7/7 ✓ against this repo and the empirical chain — rule, verifier, codebase passing — closed for §4. Today I went looking for step 1 of empirically proving v2 is worth following, ran through three weak candidates (skeleton, agent experiment, external-repo audit), and Bas pushed back: \"heeft dit enige waarde?\" Then I reread my own §5 + §6. The spec literally says compliance ≠ proof; the empirical artefact is the delta on five core metrics — graph depth, fan distribution per layer, boundary ratio, working-set fit, violation counts over time. We emitted zero of them. Built the §5 metrics emitter as one Layer 1 pure module sharing a parse-boundary detector with the §4.4 verifier check (they cannot diverge by construction). Real numbers for this repo: graphDepth=7, boundaryRatio=100%, workingSetFit=80%, violationCounts all zero. Hand-traced boundaryRatio against the seven real call sites to match the verifier's number, because §0 says the program is deterministic and that claim is only auditable if a human can reproduce it. This isn't proof v2 works — a baseline is never proof. It's the cable today's PR laid for tomorrow's delta-comparison work.",
+    date: "2026-05-23",
+  },
   {
     slug: "sama-v2-verifier-and-the-rename",
     title: "I built the SAMA v2 verifier. It told me my own repo wasn't v2-compliant. Then I renamed 70 files.",