syntaxai/tdd.md · main · content / blog / sama-v2-metrics-emitter.md

sama-v2-metrics-emitter.md 231 lines · 11628 bytes raw · source

Compliance proves the rules were followed. Delta proves they were worth following.

Yesterday's post was about getting /sama/v2/verify to report 7/7 ✓ against this repo — seventy files renamed, three small fixes, the empirical chain "here is the rule, here is the verifier, here is the codebase passing" closed.

Today started with a question: now that v2 conforms to itself, what is step 1 of empirically proving v2 is worth following? I had three candidates and ran each by myself for an hour.

  1. A skeleton generatorbunx sama-init <name> emits a scaffolded project that passes 7/7 out of the box. Useful for distribution. Doesn't prove anything about whether agents work better under v2 — that's a separate experiment.
  2. A controlled agent experiment — same task, run twice, with and without SAMA v2 in CLAUDE.md, measure the deltas. Strongest claim but days of work to set up cleanly.
  3. A public-repo audit — point the verifier at five popular OSS repos with hand-written profiles, publish what fell out. Cheap and publishable, but almost any external repo will fail v2 — not because v2 catches drift, but because that repo was never designed under any SAMA-like discipline. Uninteresting evidence.

Bas read the list back at me: "heeft dit enige waarde?" — does any of this actually have value?

The honest answer was: limited. A second codebase I write under the rules I know isn't generalisation evidence — it's me passing the same exam twice with the answer key. The real evidence needs either external adoption (someone else's profile, someone else's repo) or the agent-comparison experiment. The three candidates were either infrastructure for that, or evidence that doesn't actually prove what the headline claim would suggest.

So we backed up. What does the spec itself say step 1 looks like?

§5 says exactly what the empirical artefact is

I reread /sama/v2. The verifier I'd shipped honours §1 (the layer law), §2 (profiles), §3 (consistency), and §4 (the seven conformance checks). All of those produce a binary verdict: pass or fail.

§5 is a different shape. Quoting in full:

Every conformant repo emits these, identically, regardless of language or profile. These are the variables for A/B measurement (SAMA on vs off) — and crucially, none of them is a compliance score. They measure properties an agent's task performance should correlate with:

  • Graph depth — longest path in the import DAG.
  • Fan-in / fan-out distribution per layer.
  • Boundary ratio — share of external-input parsing that occurs in Layer 2.
  • Working-set fit — share of files within the editor LOC sweet spot.
  • Violation count over time — emitted even on conforming repos as a trailing signal.

Report the delta between SAMA-on and SAMA-off runs on these metrics — not the compliance rate. Compliance proves the rules were followed; the delta is what proves the rules were worth following.

I'd shipped the compliance half and was about to invent step 1 from scratch. Step 1 was sitting in plain sight: emit zero of these metrics today; emit them tomorrow.

§6 reinforces this. A new profile is admitted as a "falsifiable hypothesis", measured against the §5 metrics, and promoted to official only if the delta holds across multiple repos. The metrics are the dependent variable. Without them, every later experiment is running in the dark — we'd know the rules were followed, never whether they helped.

The /goal rewrote itself:

"Implement and publish the SAMA v2 §5 core metrics emitter for this repo — the empirical artefact §6 requires before any later claim can be measured as a delta."

With operational definitions for each metric pinned in the goal text itself (not "halt and document"), anti-fudge constraints on the defaults, and a requirement to hand-trace one metric so the §0 "deterministic program — no LLM judgment" claim is auditable.

What gets built and what gets shared

The five metrics map onto five small pieces of code, each pure, each running over the same (profile, files) input the verifier consumes:

  • graphDepth — memoised DFS over the import graph, cycle-safe. For this repo: 7.
  • fanByLayer — for each canonical layer, the {mean, p50, p95, max} distributions of fan-in (edges arriving) and fan-out (edges leaving). Layer 3 (Entry) shows the expected pattern here: fan-in mean ≈ 1, fan-out max = 22 — the route table imports a lot of handlers, nothing imports the route table back.
  • boundaryRatio — (parse-boundary call sites in Layer 2) ÷ (parse-boundary call sites anywhere). 100.0% for this repo; the rules say external input is parsed in Layer 2 and the numbers confirm it.
  • workingSetFit — (source files with 50 ≤ LOC ≤ 500) ÷ (total source files). 80.0%. Reported as-is; not tuned.
  • violationCounts — per-check violation count, reported even when a check passes. All seven are 0 today; that's the trailing-signal shape §5 demands, so when one of them starts drifting upward in a future commit, the signal exists.

The interesting structural move is one anti-fudge constraint from the goal: "the boundary call-site detector refactor must preserve the existing Modeled-boundary check's verdict bit-for-bit."

The §4.4 Modeled-boundary verifier check and the §5 boundaryRatio metric both depend on the same question: what counts as parsing external input? If those two definitions are allowed to drift, the verifier can say "0 violations" while the metric says "30% of boundaries are in Layer 1" — contradictory pictures of the same code.

So the detector now lives in one place — findParseBoundaryCallSites in src/a31_sama_v2.ts (Layer 0, pure). The verifier consumes it. The metric consumes it. They share the regex, the comment/string-literal stripping, the file iteration. The two cannot diverge — if a future commit changes what "parse boundary" means, both the check and the metric move in lockstep by construction. That's the kind of architectural lock §0 of the spec asks for: "A conformant verifier is a deterministic program. No LLM judgment sits in the enforcement loop."

The existing 20 verifier tests passed unchanged after the refactor. The new 23 metrics tests passed on the first run. 300/300 total.

The hand-traced metric, because the spec says so

§0 of v2 calls the verifier a deterministic program. That claim is auditable only if the metric output can be reproduced by hand from the same inputs. The goal required one of the five metrics to be hand-traced on this repo's actual source, not a synthetic fixture.

I picked boundaryRatio. A raw grep of src/*.ts (non-test) for JSON.parse( and new URL( returns eleven hits. Four of them are inside comments or string literals — "JSON.parse() constructors must not appear in..." in the docstring of c14_request_parse.ts, "parsing as JSON.parse( of arbitrary input..." in the comment header of b32_sama_v2_verify.ts. The detector strips comments and quoted literals first, so those four drop out. Seven real call sites remain:

call site file's prefix → layer
src/c13_database.ts:133 JSON.parse(row.verdict_json) c13_ → 2
src/c13_database.ts:159 JSON.parse(r.tracked_branches) c13_ → 2
src/c13_database.ts:273 JSON.parse(r.doc_json) c13_ → 2
src/c13_database.ts:373 JSON.parse(r.verdict_json) c13_ → 2
src/c14_request_parse.ts:20 new URL(text) c14_ → 2
src/c14_request_parse.ts:28 JSON.parse(text) c14_ → 2
src/c14_client_bundle.ts:72 new URL(import.meta.url) c14_ → 2

boundaryRatio = 7 / 7 = 1.0 = 100.0%. The verifier reports exactly that. The hand count and the verifier match because both consume findParseBoundaryCallSites — the same source of truth, no LLM judgment in the loop, the §0 claim made operational.

What this is, and what it deliberately isn't

It is: a baseline snapshot of this repo against the metrics §5 names. The numbers are real, derived from the actual source and the actual profile, reproducible run-over-run from the same inputs.

It isn't: proof that SAMA v2 is worth following. A single data point is never a delta. What this baseline buys is the ability to do the comparison later — a future skeleton run, a future agent A/B run, a future external-repo audit, all expressed as numbers measurable against today's baseline.

The scope cap matters too. The goal explicitly forbade building a metrics journal — no per-commit history, no /reports/metrics-over-time dashboard, no time-series storage. §5 names "violation count over time" as a metric and I limited the implementation to "violation count, now". Time-series is a separate piece of work; doing it inside the same change would have been the kind of scope creep that turns a one-day patch into a five-day branch.

The working-set bounds chose themselves to be the most honest embarrassment. Anti-fudge: if WORKING_SET_MIN/MAX produce workingSetFit < 0.5 for this repo, document the number anyway — that is the trailing signal §5 wants surfaced. The repo scores 80.0%. That's better than the floor but worse than I'd have guessed. Twenty percent of source files are outside the 50–500 LOC sweet spot — mostly Layer 0 type-only files and some Layer 1 render shards. The reasoning for the bounds (below 50 = too small to be a substantive module; above 500 = approaches the Atomic 700 cap with no headroom) is in the spec page at /sama/v2 §5, written before the numbers were computed. If a future refactor shrinks the type-only files into siblings, the metric should rise. The metric isn't there to be flattered — it's there to surface that decision.

What changes tomorrow

Every later empirical claim now has a defined "before". The skeleton generator can be measured: does a sama-init scaffold score higher than the average external repo on these metrics? The agent A/B experiment can be measured: does an agent given SAMA v2 rules produce code with lower graphDepth and higher workingSetFit than one without? An external repo audit can be measured: pick three OSS repos, write profiles, get five numbers each, compare them to ours.

None of those are publishable yet. But each becomes a delta against a number this repo already publishes, on a URL anyone can hit. The chain that mattered yesterday — "here is the rule, here is the verifier, here is the codebase passing" — extended one link today: "here are the metrics that prove following the rules was worth the cost." That link is still empty; today's commit just laid the cable.


See for yourself: