sama-v2-workingset-cross-repo-baseline.md
raw
· source
Was the dive/ripgrep convergence real? Seven measured workingSetFit datapoints
The dive audit and ripgrep audit closed with a quietly interesting finding: when I ported the §5 workingSetFit metric to Go and Rust and ran it against both repos, they landed within two percentage points of each other — dive at 52.17% (@d6c69194) and ripgrep at 54.00% (@4519153e). I noted in the home page table that "workingSetFit in the 50–55% range may be characteristic of mature compiled-language CLI tools — a hypothesis that needs more datapoints to confirm."
This post tests that hypothesis. n=2 → n=7, same tool, same bounds, same exclusion rules. Pinned SHAs throughout. The headline:
The convergence was an n=2 coincidence. The actual baseline distribution among seven mature compiled-language CLI tools spans 27 percentage points — from 46.27% (bat) to 73.59% (cli/gh) — with mean 60.68% and sample stddev 10.13pp.
But the convergence wasn't entirely an artefact: five of the seven projects fall inside the band [52%, 70%] (an 18-point window, not 2), and that clustering does suggest something real about how mature CLI codebases distribute their file sizes. The story is just more textured than n=2 implied.
The corpus
Five new repos cloned and measured, joining dive and ripgrep:
| project | language | role | stars (approx) | clone command |
|---|---|---|---|---|
| sharkdp/bat | Rust | syntax-highlighted cat |
~50k | git clone --depth=1 https://github.com/sharkdp/bat.git |
| sharkdp/fd | Rust | user-friendly find |
~37k | git clone --depth=1 https://github.com/sharkdp/fd.git |
| eza-community/eza | Rust | modern ls (fork of exa) |
~12k | git clone --depth=1 https://github.com/eza-community/eza.git |
| jesseduffield/lazygit | Go | terminal UI for git | ~60k | git clone --depth=1 https://github.com/jesseduffield/lazygit.git |
| cli/cli | Go | GitHub's official gh CLI |
~37k | git clone --depth=1 https://github.com/cli/cli.git |
Corpus criteria: each project is a CLI tool, widely used (10k+ stars), mature (5+ year codebase), and primarily written in its target language. dive and ripgrep from the prior audits round out a 4-Rust / 3-Go split.
Methodology
The polyglot §5 emitter at scripts/measure-working-set.ts was used unchanged. The bounds [50, 500] LOC inclusive are imported from WORKING_SET_MIN_LOC and WORKING_SET_MAX_LOC in src/a31_sama_v2.ts — the same constants the /sama/v2/verify page uses against this site's own source. Single source of truth: the cross-repo numbers are computed against the exact band the spec defines.
LOC for each file = content.split("\n").length, matching the TS reference implementation byte-for-byte. Test-file exclusion rule: Go excludes *_test.go (mirroring TS's *.test.ts exclusion); Rust includes all .rs files because Rust's convention is inline #[cfg(test)] mod tests — formalised at /sama/v2 §6.2 inline-tests dialect. Skipped directories: .git/, target/, vendor/, node_modules/, all dotdirs.
Hand-trace — bat (the lowest measurement)
Per /sama/v2 §0 the verifier is a deterministic program; that claim is only auditable if a human can reproduce the number from the data. So:
cd /tmp/bat # at SHA f3d07734
find . -name '*.rs' -type f \
-not -path '*/.git/*' -not -path '*/target/*' \
-not -path '*/vendor/*' -not -path '*/node_modules/*' \
| wc -l
# 67 total .rs files
# For each file, count newlines, add 1, check [50, 500] inclusive:
in_band=0
while read -r f; do
newlines=$(tr -cd '\n' < "$f" | wc -c)
lines=$((newlines + 1))
if [ "$lines" -ge 50 ] && [ "$lines" -le 500 ]; then
in_band=$((in_band + 1))
fi
done < <(find . -name '*.rs' -type f \
-not -path '*/.git/*' -not -path '*/target/*' \
-not -path '*/vendor/*' -not -path '*/node_modules/*')
echo "in band: $in_band"
# 31
echo "ratio: $(echo "scale=4; $in_band / 67" | bc)"
# .4626
The polyglot emitter produces the same numbers: 67 total, 31 included, ratio 0.4627 (rounding-bit difference at the fifth decimal). 46.27% measured. Auditable per §0.
The seven datapoints
Sorted by workingSetFit descending:
| rank | project | language | SHA | total | included | ratio | % |
|---|---|---|---|---|---|---|---|
| 1 | cli/cli (gh) | Go | e53ff321 |
515 | 379 | 0.7359 | 73.59% |
| 2 | sharkdp/fd | Rust | 42b2ab8a |
23 | 16 | 0.6957 | 69.57% |
| 3 | jesseduffield/lazygit | Go | 608c90ae |
883 | 595 | 0.6738 | 67.38% |
| 4 | eza-community/eza | Rust | eed27ed0 |
68 | 42 | 0.6176 | 61.76% |
| 5 | BurntSushi/ripgrep | Rust | 4519153e |
100 | 54 | 0.5400 | 54.00% |
| 6 | wagoodman/dive | Go | d6c69194 |
92 | 48 | 0.5217 | 52.17% |
| 7 | sharkdp/bat | Rust | f3d07734 |
67 | 31 | 0.4627 | 46.27% |
For reference (not included in the cross-repo baseline because it's the SAMA-disciplined dogfood, not a non-SAMA mature CLI tool): tdd.md (this site, TypeScript) measures 80.00% at the live /sama/v2/verify endpoint.
Distribution
46 50 55 60 65 70 75
|---|-------|-------|-------|-------|-------|
bat 46.27
dive 52.17
ripgrep 54.00
eza 61.76
lazygit 67.38
fd 69.57
cli/gh 73.59
(mature CLI baseline)
---80.00--- tdd.md (SAMA)
- Range: 46.27% – 73.59% (spread 27.32 percentage points)
- Mean: 60.68%
- Median: 61.76% (eza)
- Sample stddev: 10.13 pp
- Inter-quartile range (sort positions 2 and 6): 52.17% – 69.57% (spread 17.40 pp)
Five of seven projects fall in [52%, 70%] — a real clustering, though wider than the dive/ripgrep coincidence suggested.
Go vs Rust subset
| subset | n | mean | median | range |
|---|---|---|---|---|
| Go (cli, lazygit, dive) | 3 | 64.38% | 67.38% | 52.17–73.59 (21.42 pp) |
| Rust (fd, eza, ripgrep, bat) | 4 | 57.90% | 57.88% | 46.27–69.57 (23.30 pp) |
Go averages ~6 percentage points higher than Rust at n=3 vs n=4. Sample sizes are small; the gap may not survive a larger corpus. But: nothing in either subset cleanly clusters; both span ~20+ points. The hypothesis that "Go projects are tighter than Rust projects on this axis" is consistent with the data but not evidenced by it.
Per-project notes
A 1-2 sentence read on what each project's distribution implies. The polyglot emitter's --verbose flag emits the per-file LOC breakdown if you want to follow up.
cli/cli at 73.59% — the highest measured score. 515 Go files, of which 379 land in band. Reading the over-band tail reveals it's mostly large command-handler files (
pkg/cmd/repo/sync/sync.goand similar) — natural behavioural cohesion, not god-classes. Likely a real architectural fit signal.sharkdp/fd at 69.57% — second highest, and the smallest project in the corpus by file count (23 .rs files). High
workingSetFitpartly reflects that there are few files to be tiny stubs against. With n=23, the metric is noisier; honest to report.jesseduffield/lazygit at 67.38% — the biggest project in the corpus (883 .go files) and still clears 67%. That's the impressive number in the table: even at scale, a Go TUI keeps two-thirds of its files in the substantive-module band.
eza-community/eza at 61.76% — median of the seven. The audit-style observation: eza inherits its layout from
exa(its predecessor) and the file-size distribution looks deliberate — small modules tend to be the leaf-renderers for one column-formatter each, not stubs.BurntSushi/ripgrep at 54.00% — the prior audit identified 30 files over 500 LOC. Most are the textbook declarative-exempt cases the §6.3 declarative-exemption dialect was drafted for; the raw metric doesn't distinguish them. The audit goes into more detail.
wagoodman/dive at 52.17% — the prior audit identified the opposite shape: 0 files over 500 LOC, 44 under 50 LOC. Tiny type-stubs and platform-shims pull the score down, not god-classes.
sharkdp/bat at 46.27% — the lowest measurement. Reading the distribution: the over-band tail (
src/printer.rsat ~2,100 LOC,src/assets.rs,src/config.rs) is sizeable, but the under-50 tail is also substantial. Bat has many small "language definition" modules that pre-build syntax highlighting for the supported languages — by-construction declarative shards. Like the ripgrepdefs.rscase, the raw metric doesn't distinguish them from "this file is too small."
What this answers and what it doesn't
Answers the convergence question: the dive/ripgrep 2-point landing was n=2 coincidence. The real distribution spans ~27 percentage points. But there's still a real clustering effect: most mature CLI tools land between 50% and 70%, with the median right at 60%.
Does not yet answer the SAMA-vs-non-SAMA question. That requires a second SAMA-disciplined repo measured against the same axes, and only one exists today (this site, at 80%). One SAMA datapoint above the entire non-SAMA distribution is suggestive — tdd.md's 80% sits 6.4 percentage points above the top of the mature-CLI baseline (cli/gh, 73.59%) — but n=1 vs n=7 is far from a SAMA-worth-following claim. §6 of the spec is explicit that promotion requires cross-repo deltas, not a single dogfood.
What this run does establish:
- The empirical chain is now n=7 measured against the same bounds. Before today, the cross-repo argument was "tdd.md is measured, the audits are hand-estimated." Now the audits and five new baseline datapoints are measured. The estimates are gone from this column of the table.
- The metric is more discriminating than n=2 implied. A 27-point spread is meaningful — workingSetFit does distinguish projects from one another, even within the narrow category of "mature compiled-language CLI tools."
- The §6 falsifiable experiment is now well-conditioned. When a second SAMA repo exists, comparing its workingSetFit against this seven-row baseline is a real test, not a vibes call. The baseline distribution (mean, range, stddev) is what the test compares against.
Reproducibility
Anyone with the polyglot emitter and the pinned SHAs can reproduce these numbers exactly. The repo has the tool; the SHAs are in the table above; the bounds live in source as constants. Run:
git clone --depth=1 https://github.com/sharkdp/bat.git /tmp/bat
cd /tmp/bat && git checkout f3d077346824eae07fbac4b56466d27049b9616e
bun /path/to/tdd.md/scripts/measure-working-set.ts /tmp/bat --lang rust
# {"total": 67, "included": 31, "ratio": 0.4626865671641791, "ratioPercent": 46.27}
That's the §0 contract: the program is deterministic; the same source tree + same bounds produces the same number; a human can reproduce it from the spec. Seven times over, now.
Companion posts:
- The dive audit — where the dive measurement is hand-traced
- The ripgrep audit — where the ripgrep measurement is hand-traced
- The §5 metrics emitter post — why measurement matters more than estimates
- The v2.1 dialects (§6.1–6.3) — particularly §6.2 inline-tests (load-bearing for the Rust file-counting rule above) and §6.3 declarative-exemption (the policy lens for what the raw metric can't distinguish)