b0576f2eb52ed49b31b95aad5850e454bdd8432e diff --git a/content/blog/sama-v2-metrics-emitter.md b/content/blog/sama-v2-metrics-emitter.md new file mode 100644 index 0000000000000000000000000000000000000000..ad38fa25276eb9b2b6af2288048f3bbb5f578939 --- /dev/null +++ b/content/blog/sama-v2-metrics-emitter.md @@ -0,0 +1,230 @@ +# Compliance proves the rules were followed. Delta proves they were worth following. + +Yesterday's [post](/blog/sama-v2-verifier-and-the-rename) was about getting +[`/sama/v2/verify`](/sama/v2/verify) to report 7/7 ✓ against this repo — +seventy files renamed, three small fixes, the empirical chain "here is +the rule, here is the verifier, here is the codebase passing" closed. + +Today started with a question: now that v2 conforms to itself, +*what is step 1 of empirically proving v2 is worth following?* I had +three candidates and ran each by myself for an hour. + +1. **A skeleton generator** — `bunx sama-init ` emits a + scaffolded project that passes 7/7 out of the box. Useful for + distribution. Doesn't prove anything about whether agents work + better under v2 — that's a separate experiment. +2. **A controlled agent experiment** — same task, run twice, with + and without SAMA v2 in `CLAUDE.md`, measure the deltas. Strongest + claim but days of work to set up cleanly. +3. **A public-repo audit** — point the verifier at five popular OSS + repos with hand-written profiles, publish what fell out. Cheap + and publishable, but almost any external repo will fail v2 — + not because v2 catches drift, but because that repo was never + designed under any SAMA-like discipline. Uninteresting evidence. + +Bas read the list back at me: *"heeft dit enige waarde?"* — does any +of this actually have value? + +The honest answer was: limited. A second codebase I write under the +rules I know isn't generalisation evidence — it's me passing the same +exam twice with the answer key. The real evidence needs either +external adoption (someone else's profile, someone else's repo) or +the agent-comparison experiment. The three candidates were either +infrastructure for that, or evidence that doesn't actually prove what +the headline claim would suggest. + +So we backed up. *What does the spec itself say step 1 looks like?* + +## §5 says exactly what the empirical artefact is + +I reread [`/sama/v2`](/sama/v2). The verifier I'd shipped honours §1 +(the layer law), §2 (profiles), §3 (consistency), and §4 (the seven +conformance checks). All of those produce a binary verdict: pass or +fail. + +§5 is a different shape. Quoting in full: + +> *Every conformant repo emits these, identically, regardless of +> language or profile. These are the variables for A/B measurement +> (`SAMA on` vs `off`) — and crucially, **none of them is a +> compliance score.** They measure properties an agent's task +> performance should correlate with:* +> +> - *Graph depth — longest path in the import DAG.* +> - *Fan-in / fan-out distribution per layer.* +> - *Boundary ratio — share of external-input parsing that occurs in Layer 2.* +> - *Working-set fit — share of files within the editor LOC sweet spot.* +> - *Violation count over time — emitted even on conforming repos as a trailing signal.* +> +> *Report the **delta** between SAMA-on and SAMA-off runs on these +> metrics — not the compliance rate. Compliance proves the rules were +> followed; the delta is what proves the rules were* worth *following.* + +I'd shipped the compliance half and was about to invent step 1 from +scratch. Step 1 was sitting in plain sight: emit zero of these metrics +today; emit them tomorrow. + +§6 reinforces this. A new profile is admitted as a *"falsifiable +hypothesis"*, measured against the §5 metrics, and promoted to +official only if the delta holds across multiple repos. The metrics +are the dependent variable. Without them, every later experiment is +running in the dark — we'd know the rules were followed, never +whether they helped. + +The `/goal` rewrote itself: + +> *"Implement and publish the SAMA v2 §5 core metrics emitter for +> this repo — the empirical artefact §6 requires before any later +> claim can be measured as a delta."* + +With operational definitions for each metric pinned in the goal text +itself (not "halt and document"), anti-fudge constraints on the +defaults, and a requirement to hand-trace one metric so the §0 +*"deterministic program — no LLM judgment"* claim is auditable. + +## What gets built and what gets *shared* + +The five metrics map onto five small pieces of code, each pure, each +running over the same `(profile, files)` input the verifier consumes: + +- **`graphDepth`** — memoised DFS over the import graph, cycle-safe. + For this repo: **7**. +- **`fanByLayer`** — for each canonical layer, the `{mean, p50, p95, max}` + distributions of fan-in (edges arriving) and fan-out (edges + leaving). Layer 3 (Entry) shows the expected pattern here: + fan-in mean ≈ 1, fan-out max = 22 — the route table imports a lot + of handlers, nothing imports the route table back. +- **`boundaryRatio`** — (parse-boundary call sites in Layer 2) ÷ + (parse-boundary call sites anywhere). **100.0%** for this repo; + the rules say external input is parsed in Layer 2 and the numbers + confirm it. +- **`workingSetFit`** — (source files with `50 ≤ LOC ≤ 500`) ÷ (total + source files). **80.0%**. Reported as-is; not tuned. +- **`violationCounts`** — per-check violation count, reported even + when a check passes. All seven are `0` today; that's the + trailing-signal shape §5 demands, so when one of them starts + drifting upward in a future commit, the signal exists. + +The interesting structural move is one anti-fudge constraint from +the goal: *"the boundary call-site detector refactor must preserve +the existing Modeled-boundary check's verdict bit-for-bit."* + +The §4.4 Modeled-boundary verifier check and the §5 `boundaryRatio` +metric both depend on the same question: *what counts as parsing +external input?* If those two definitions are allowed to drift, the +verifier can say "0 violations" while the metric says "30% of +boundaries are in Layer 1" — contradictory pictures of the same code. + +So the detector now lives in one place — `findParseBoundaryCallSites` +in [`src/a31_sama_v2.ts`](/GIT/syntaxai/tdd.md/blob/main/src/a31_sama_v2.ts) +(Layer 0, pure). The verifier consumes it. The metric consumes it. +They share the regex, the comment/string-literal stripping, the file +iteration. The two cannot diverge — if a future commit changes what +"parse boundary" means, both the check and the metric move in lockstep +by construction. That's the kind of architectural lock §0 of the spec +asks for: *"A conformant verifier is a deterministic program. No LLM +judgment sits in the enforcement loop."* + +The existing 20 verifier tests passed unchanged after the refactor. +The new 23 metrics tests passed on the first run. 300/300 total. + +## The hand-traced metric, because the spec says so + +§0 of v2 calls the verifier a *deterministic program*. That claim is +auditable only if the metric output can be reproduced by hand from +the same inputs. The goal required one of the five metrics to be +hand-traced on this repo's actual source, not a synthetic fixture. + +I picked `boundaryRatio`. A raw grep of `src/*.ts` (non-test) for +`JSON.parse(` and `new URL(` returns eleven hits. Four of them are +inside comments or string literals — *"`JSON.parse()` constructors +must not appear in..."* in the docstring of +[`c14_request_parse.ts`](/GIT/syntaxai/tdd.md/blob/main/src/c14_request_parse.ts), +*"parsing as `JSON.parse(` of arbitrary input..."* in the comment +header of +[`b32_sama_v2_verify.ts`](/GIT/syntaxai/tdd.md/blob/main/src/b32_sama_v2_verify.ts). +The detector strips comments and quoted literals first, so those +four drop out. Seven real call sites remain: + +| call site | file's prefix → layer | +|---|---| +| `src/c13_database.ts:133` `JSON.parse(row.verdict_json)` | `c13_` → 2 | +| `src/c13_database.ts:159` `JSON.parse(r.tracked_branches)` | `c13_` → 2 | +| `src/c13_database.ts:273` `JSON.parse(r.doc_json)` | `c13_` → 2 | +| `src/c13_database.ts:373` `JSON.parse(r.verdict_json)` | `c13_` → 2 | +| `src/c14_request_parse.ts:20` `new URL(text)` | `c14_` → 2 | +| `src/c14_request_parse.ts:28` `JSON.parse(text)` | `c14_` → 2 | +| `src/c14_client_bundle.ts:72` `new URL(import.meta.url)` | `c14_` → 2 | + +`boundaryRatio = 7 / 7 = 1.0 = 100.0%`. The verifier reports +exactly that. The hand count and the verifier match because both +consume `findParseBoundaryCallSites` — the same source of truth, no +LLM judgment in the loop, the §0 claim made operational. + +## What this is, and what it deliberately isn't + +It is: a baseline snapshot of this repo against the metrics §5 names. +The numbers are real, derived from the actual source and the actual +profile, reproducible run-over-run from the same inputs. + +It isn't: proof that SAMA v2 is worth following. **A single +data point is never a delta.** What this baseline buys is the +ability to *do* the comparison later — a future skeleton run, a +future agent A/B run, a future external-repo audit, all expressed +as numbers measurable against today's baseline. + +The scope cap matters too. The goal explicitly forbade building a +metrics journal — no per-commit history, no `/reports/metrics-over-time` +dashboard, no time-series storage. §5 names "violation count over +time" as a metric and I limited the implementation to "violation +count, now". Time-series is a separate piece of work; doing it +inside the same change would have been the kind of scope creep that +turns a one-day patch into a five-day branch. + +The working-set bounds chose themselves to be the most honest +embarrassment. *Anti-fudge: if `WORKING_SET_MIN/MAX` produce +`workingSetFit < 0.5` for this repo, document the number anyway — +that is the trailing signal §5 wants surfaced.* The repo scores +**80.0%**. That's better than the floor but worse than I'd have +guessed. Twenty percent of source files are outside the 50–500 LOC +sweet spot — mostly Layer 0 type-only files and some Layer 1 +render shards. The reasoning for the bounds (below 50 = too small +to be a substantive module; above 500 = approaches the Atomic 700 +cap with no headroom) is in the spec page at +[/sama/v2 §5](/sama/v2#5-operational--core-metrics-definitions), +written before the numbers were computed. If a future refactor +shrinks the type-only files into siblings, the metric should rise. +The metric isn't there to be flattered — it's there to surface +that decision. + +## What changes tomorrow + +Every later empirical claim now has a defined "before". The +skeleton generator can be measured: does a `sama-init` scaffold +score higher than the average external repo on these metrics? The +agent A/B experiment can be measured: does an agent given SAMA v2 +rules produce code with lower `graphDepth` and higher +`workingSetFit` than one without? An external repo audit can be +measured: pick three OSS repos, write profiles, get five numbers +each, compare them to ours. + +None of those are publishable yet. But each becomes a delta against +a number this repo already publishes, on a URL anyone can hit. The +chain that mattered yesterday — *"here is the rule, here is the +verifier, here is the codebase passing"* — extended one link +today: *"here are the metrics that prove following the rules was +worth the cost."* That link is still empty; today's commit just +laid the cable. + +--- + +**See for yourself:** + +- Live verdict + metrics: (7/7 ✓, graphDepth=7, boundaryRatio=100%, workingSetFit=80%) +- The §5 operational definitions: +- The PR that landed the work: [#17](https://github.com/syntaxai/tdd.md/pull/17) +- Yesterday's post: [I built the SAMA v2 verifier...](/blog/sama-v2-verifier-and-the-rename) +- Earlier in the series: + [c21 Atomic split](/blog/sama-empirical-c21-split) · + [Modeled green](/blog/sama-empirical-modeled-green) · + [Deploy that lies](/blog/deploy-that-lies-cascade) diff --git a/src/a31_blog.ts b/src/a31_blog.ts index e6d9427491cd91e4409391a4c71fbb23ee25459c..9fa42f6fd7e8d3e37c408612be7ac1087cd99a9b 100644 --- a/src/a31_blog.ts +++ b/src/a31_blog.ts @@ -12,6 +12,12 @@ export interface BlogEntry { } export const ALL_POSTS: BlogEntry[] = [ + { + slug: "sama-v2-metrics-emitter", + title: "Compliance proves the rules were followed. Delta proves they were worth following.", + description: "Yesterday the v2 verifier said 7/7 ✓ against this repo and the empirical chain — rule, verifier, codebase passing — closed for §4. Today I went looking for step 1 of empirically proving v2 is worth following, ran through three weak candidates (skeleton, agent experiment, external-repo audit), and Bas pushed back: \"heeft dit enige waarde?\" Then I reread my own §5 + §6. The spec literally says compliance ≠ proof; the empirical artefact is the delta on five core metrics — graph depth, fan distribution per layer, boundary ratio, working-set fit, violation counts over time. We emitted zero of them. Built the §5 metrics emitter as one Layer 1 pure module sharing a parse-boundary detector with the §4.4 verifier check (they cannot diverge by construction). Real numbers for this repo: graphDepth=7, boundaryRatio=100%, workingSetFit=80%, violationCounts all zero. Hand-traced boundaryRatio against the seven real call sites to match the verifier's number, because §0 says the program is deterministic and that claim is only auditable if a human can reproduce it. This isn't proof v2 works — a baseline is never proof. It's the cable today's PR laid for tomorrow's delta-comparison work.", + date: "2026-05-23", + }, { slug: "sama-v2-verifier-and-the-rename", title: "I built the SAMA v2 verifier. It told me my own repo wasn't v2-compliant. Then I renamed 70 files.",