syntaxai/tdd.md · main · content / blog / sama-v2-workingset-cross-repo-baseline.md

sama-v2-workingset-cross-repo-baseline.md 162 lines · 12974 bytes raw · source

Was the dive/ripgrep convergence real? Seven measured workingSetFit datapoints

The dive audit and ripgrep audit closed with a quietly interesting finding: when I ported the §5 workingSetFit metric to Go and Rust and ran it against both repos, they landed within two percentage points of each other — dive at 52.17% (@d6c69194) and ripgrep at 54.00% (@4519153e). I noted in the home page table that "workingSetFit in the 50–55% range may be characteristic of mature compiled-language CLI tools — a hypothesis that needs more datapoints to confirm."

This post tests that hypothesis. n=2 → n=7, same tool, same bounds, same exclusion rules. Pinned SHAs throughout. The headline:

The convergence was an n=2 coincidence. The actual baseline distribution among seven mature compiled-language CLI tools spans 27 percentage points — from 46.27% (bat) to 73.59% (cli/gh) — with mean 60.68% and sample stddev 10.13pp.

But the convergence wasn't entirely an artefact: five of the seven projects fall inside the band [52%, 70%] (an 18-point window, not 2), and that clustering does suggest something real about how mature CLI codebases distribute their file sizes. The story is just more textured than n=2 implied.

The corpus

Five new repos cloned and measured, joining dive and ripgrep:

project language role stars (approx) clone command
sharkdp/bat Rust syntax-highlighted cat ~50k git clone --depth=1 https://github.com/sharkdp/bat.git
sharkdp/fd Rust user-friendly find ~37k git clone --depth=1 https://github.com/sharkdp/fd.git
eza-community/eza Rust modern ls (fork of exa) ~12k git clone --depth=1 https://github.com/eza-community/eza.git
jesseduffield/lazygit Go terminal UI for git ~60k git clone --depth=1 https://github.com/jesseduffield/lazygit.git
cli/cli Go GitHub's official gh CLI ~37k git clone --depth=1 https://github.com/cli/cli.git

Corpus criteria: each project is a CLI tool, widely used (10k+ stars), mature (5+ year codebase), and primarily written in its target language. dive and ripgrep from the prior audits round out a 4-Rust / 3-Go split.

Methodology

The polyglot §5 emitter at scripts/measure-working-set.ts was used unchanged. The bounds [50, 500] LOC inclusive are imported from WORKING_SET_MIN_LOC and WORKING_SET_MAX_LOC in src/a31_sama_v2.ts — the same constants the /sama/v2/verify page uses against this site's own source. Single source of truth: the cross-repo numbers are computed against the exact band the spec defines.

LOC for each file = content.split("\n").length, matching the TS reference implementation byte-for-byte. Test-file exclusion rule: Go excludes *_test.go (mirroring TS's *.test.ts exclusion); Rust includes all .rs files because Rust's convention is inline #[cfg(test)] mod tests — formalised at /sama/v2 §6.2 inline-tests dialect. Skipped directories: .git/, target/, vendor/, node_modules/, all dotdirs.

Hand-trace — bat (the lowest measurement)

Per /sama/v2 §0 the verifier is a deterministic program; that claim is only auditable if a human can reproduce the number from the data. So:

cd /tmp/bat   # at SHA f3d07734
find . -name '*.rs' -type f \
  -not -path '*/.git/*' -not -path '*/target/*' \
  -not -path '*/vendor/*' -not -path '*/node_modules/*' \
  | wc -l
# 67   total .rs files

# For each file, count newlines, add 1, check [50, 500] inclusive:
in_band=0
while read -r f; do
  newlines=$(tr -cd '\n' < "$f" | wc -c)
  lines=$((newlines + 1))
  if [ "$lines" -ge 50 ] && [ "$lines" -le 500 ]; then
    in_band=$((in_band + 1))
  fi
done < <(find . -name '*.rs' -type f \
            -not -path '*/.git/*' -not -path '*/target/*' \
            -not -path '*/vendor/*' -not -path '*/node_modules/*')
echo "in band: $in_band"
# 31
echo "ratio: $(echo "scale=4; $in_band / 67" | bc)"
# .4626

The polyglot emitter produces the same numbers: 67 total, 31 included, ratio 0.4627 (rounding-bit difference at the fifth decimal). 46.27% measured. Auditable per §0.

The seven datapoints

Sorted by workingSetFit descending:

rank project language SHA total included ratio %
1 cli/cli (gh) Go e53ff321 515 379 0.7359 73.59%
2 sharkdp/fd Rust 42b2ab8a 23 16 0.6957 69.57%
3 jesseduffield/lazygit Go 608c90ae 883 595 0.6738 67.38%
4 eza-community/eza Rust eed27ed0 68 42 0.6176 61.76%
5 BurntSushi/ripgrep Rust 4519153e 100 54 0.5400 54.00%
6 wagoodman/dive Go d6c69194 92 48 0.5217 52.17%
7 sharkdp/bat Rust f3d07734 67 31 0.4627 46.27%

For reference (not included in the cross-repo baseline because it's the SAMA-disciplined dogfood, not a non-SAMA mature CLI tool): tdd.md (this site, TypeScript) measures 80.00% at the live /sama/v2/verify endpoint.

Distribution

   46  50      55      60      65      70      75
   |---|-------|-------|-------|-------|-------|
   bat                                                       46.27
                  dive                                       52.17
                    ripgrep                                  54.00
                              eza                            61.76
                                       lazygit               67.38
                                          fd                 69.57
                                                cli/gh       73.59
                                                            (mature CLI baseline)
                                                            ---80.00---  tdd.md (SAMA)
  • Range: 46.27% – 73.59% (spread 27.32 percentage points)
  • Mean: 60.68%
  • Median: 61.76% (eza)
  • Sample stddev: 10.13 pp
  • Inter-quartile range (sort positions 2 and 6): 52.17% – 69.57% (spread 17.40 pp)

Five of seven projects fall in [52%, 70%] — a real clustering, though wider than the dive/ripgrep coincidence suggested.

Go vs Rust subset

subset n mean median range
Go (cli, lazygit, dive) 3 64.38% 67.38% 52.17–73.59 (21.42 pp)
Rust (fd, eza, ripgrep, bat) 4 57.90% 57.88% 46.27–69.57 (23.30 pp)

Go averages ~6 percentage points higher than Rust at n=3 vs n=4. Sample sizes are small; the gap may not survive a larger corpus. But: nothing in either subset cleanly clusters; both span ~20+ points. The hypothesis that "Go projects are tighter than Rust projects on this axis" is consistent with the data but not evidenced by it.

Per-project notes

A 1-2 sentence read on what each project's distribution implies. The polyglot emitter's --verbose flag emits the per-file LOC breakdown if you want to follow up.

  • cli/cli at 73.59% — the highest measured score. 515 Go files, of which 379 land in band. Reading the over-band tail reveals it's mostly large command-handler files (pkg/cmd/repo/sync/sync.go and similar) — natural behavioural cohesion, not god-classes. Likely a real architectural fit signal.

  • sharkdp/fd at 69.57% — second highest, and the smallest project in the corpus by file count (23 .rs files). High workingSetFit partly reflects that there are few files to be tiny stubs against. With n=23, the metric is noisier; honest to report.

  • jesseduffield/lazygit at 67.38% — the biggest project in the corpus (883 .go files) and still clears 67%. That's the impressive number in the table: even at scale, a Go TUI keeps two-thirds of its files in the substantive-module band.

  • eza-community/eza at 61.76% — median of the seven. The audit-style observation: eza inherits its layout from exa (its predecessor) and the file-size distribution looks deliberate — small modules tend to be the leaf-renderers for one column-formatter each, not stubs.

  • BurntSushi/ripgrep at 54.00% — the prior audit identified 30 files over 500 LOC. Most are the textbook declarative-exempt cases the §6.3 declarative-exemption dialect was drafted for; the raw metric doesn't distinguish them. The audit goes into more detail.

  • wagoodman/dive at 52.17% — the prior audit identified the opposite shape: 0 files over 500 LOC, 44 under 50 LOC. Tiny type-stubs and platform-shims pull the score down, not god-classes.

  • sharkdp/bat at 46.27% — the lowest measurement. Reading the distribution: the over-band tail (src/printer.rs at ~2,100 LOC, src/assets.rs, src/config.rs) is sizeable, but the under-50 tail is also substantial. Bat has many small "language definition" modules that pre-build syntax highlighting for the supported languages — by-construction declarative shards. Like the ripgrep defs.rs case, the raw metric doesn't distinguish them from "this file is too small."

What this answers and what it doesn't

Answers the convergence question: the dive/ripgrep 2-point landing was n=2 coincidence. The real distribution spans ~27 percentage points. But there's still a real clustering effect: most mature CLI tools land between 50% and 70%, with the median right at 60%.

Does not yet answer the SAMA-vs-non-SAMA question. That requires a second SAMA-disciplined repo measured against the same axes, and only one exists today (this site, at 80%). One SAMA datapoint above the entire non-SAMA distribution is suggestive — tdd.md's 80% sits 6.4 percentage points above the top of the mature-CLI baseline (cli/gh, 73.59%) — but n=1 vs n=7 is far from a SAMA-worth-following claim. §6 of the spec is explicit that promotion requires cross-repo deltas, not a single dogfood.

What this run does establish:

  1. The empirical chain is now n=7 measured against the same bounds. Before today, the cross-repo argument was "tdd.md is measured, the audits are hand-estimated." Now the audits and five new baseline datapoints are measured. The estimates are gone from this column of the table.
  2. The metric is more discriminating than n=2 implied. A 27-point spread is meaningful — workingSetFit does distinguish projects from one another, even within the narrow category of "mature compiled-language CLI tools."
  3. The §6 falsifiable experiment is now well-conditioned. When a second SAMA repo exists, comparing its workingSetFit against this seven-row baseline is a real test, not a vibes call. The baseline distribution (mean, range, stddev) is what the test compares against.

Reproducibility

Anyone with the polyglot emitter and the pinned SHAs can reproduce these numbers exactly. The repo has the tool; the SHAs are in the table above; the bounds live in source as constants. Run:

git clone --depth=1 https://github.com/sharkdp/bat.git /tmp/bat
cd /tmp/bat && git checkout f3d077346824eae07fbac4b56466d27049b9616e
bun /path/to/tdd.md/scripts/measure-working-set.ts /tmp/bat --lang rust
# {"total": 67, "included": 31, "ratio": 0.4626865671641791, "ratioPercent": 46.27}

That's the §0 contract: the program is deterministic; the same source tree + same bounds produces the same number; a human can reproduce it from the spec. Seven times over, now.


Companion posts: