SAMA

Architecture as code, for code AI writes.

SAMA is to agent-written code what Conventional Commits is to git history: a small, named, verifiable standard your CI enforces so your AI coding agents stop drifting.

Four pillars. One verifier. Zero ambiguity for your agent.

This site is the live dogfood

The formal specification — frozen core + profile mechanism, written so a deterministic verifier in any language can ingest it — lives at /sama/v2 (v2.0 draft). The legacy practitioner-facing v1 pages live at /sama.

The verifier at /sama/v2/verify runs the seven §4 conformance checks against this very repository's source on every deploy. Right now it reports 7 of 7 ✓ conforming · 91 files examined. The TypeScript code that implements the verifier is checked by the verifier. The website is the spec is the verifier is the test suite.

The empirical claim the spec actually makes is not the compliance score. Quoting §5 verbatim:

Compliance proves the rules were followed; the delta is what proves the rules were worth following.

The five §5 core metrics — graphDepth · fanByLayer · boundaryRatio · workingSetFit · violationCounts — are emitted alongside the verdict (live, scroll to "Core metrics") so any later claim about SAMA's value can be measured as a delta against today's baseline rather than against itself.

Run the verifier on your own code

The same §4 checks, as a one-line install — pure POSIX shell, no language runtime:

curl -fsSL https://tdd.md/install | bash

Drops sama into ~/.sama-cli/ and symlinks ~/.local/bin/sama. Then in any repo with a sama.profile.toml:

sama check       # all seven §4 checks
sama graph       # graphviz import diagram
sama doctor      # tool availability

This is the second independent oracle for the §4 spec — a shell verifier (bash + find + grep + awk + wc) that reads the spec separately from the TypeScript one at /sama/v2/verify. Both report 7 of 7 ✓ against this repo; cross-verifier agreement is what makes the verdict empirical rather than just deterministic. Source: tools/sama-cli · context: the verifier-second-opinion drama post.

The four pillars

S — Sorted. Lexicographic file order equals import direction. The dependency graph is the file tree.
A — Architecture. Every file's prefix maps to one layer with explicit allowed/forbidden contents. No rogue files.
M — Modeled. Every behavior file has a sibling test. Every external input is parsed at the boundary, never cast.
A — Atomic. Files cap at ~700 lines. Split per domain, never via barrel re-exports.

SAMA in your agent-coding stack

SAMA composes with the tools you already use. Use AGENTS.md to instruct the agent and SAMA to shape the code; use Factory's scorecard for breadth and SAMA for depth on the architectural pillar; run SWE-bench to grade the agent and SAMA to grade what the agent left behind.

   SWE-bench              ·  grades the agent
   ─────────────────────────────────────────────────────────
   Factory.ai Readiness   ·  scores the repo (8 pillars)
   ─────────────────────────────────────────────────────────
   SAMA              ★    ·  shapes the code (4 pillars, CI-gated)
   ─────────────────────────────────────────────────────────
   AGENTS.md              ·  instructs the agent

★ SAMA is the architecture layer — the only one with a binary CI gate. The others assess, grade, or instruct; SAMA constrains.

	What it does	SAMA's role alongside it
SWE-bench	Scores agents on real GitHub issues	SAMA scores codebases, not agents
AGENTS.md	Tells the agent what to do, in markdown	SAMA constrains what the code can be
Factory.ai Agent Readiness	8-pillar repo maturity scorecard	SAMA enforces four rules with a binary CI gate
Tweag Agentic Handbook	Describes patterns that work	SAMA prescribes — and verifies

Why this matters

LLMs degrade as input context grows. Chroma's Context Rot research shows the effect across all 18 frontier models tested, well within their advertised windows. Aider's repo-map — structural, not semantic — operates at 4–7% context utilization while semantic indexers spend 14%+ on the same task. Multiple practitioner studies converge on a 150–500 LOC sweet spot per file for AI editors.

SAMA bundles those findings into four constraints a CI job can enforce. Sorted makes structural retrieval cheap. Atomic keeps every file inside the agent's working set. Modeled makes every change reviewable by its sibling test. Architecture lets the agent answer "where does this go?" without re-deriving the tree each session.

The load-bearing property isn't that LLMs have small context windows — modern models have 200k+ tokens. The load-bearing property is mechanical enforceability: the verifier fails the build when a file crosses the line cap or an import points the wrong way. Discipline that lives only in code review quietly slips under agent pressure; discipline that lives in a CI gate keeps its shape across an arbitrary number of agent commits. The context-window research above explains the why; the verifier explains the how.

Datapoints on the same axes

Empirical baseline so far. The §4 score for this site is computed live; the §4 scores for the other repos are hand-estimated. The workingSetFit column is now measured for the SAMA dogfood (this site) and seven non-SAMA mature compiled-language CLI tools by the polyglot §5 emitter at scripts/measure-working-set.ts — see the seven-datapoint baseline post for the full table, distribution, and hand-trace.

project	language	§4 score	workingSetFit	boundaryRatio	graphDepth
tdd.md (this site, SAMA-disciplined)	TypeScript	7 / 7 ✓ (measured)	80.00% (measured)	100% (measured)	7 (measured)
cli/cli (gh)	Go	n/a (not audited)	73.59% (measured, @e53ff321)	—	—
sharkdp/fd	Rust	n/a (not audited)	69.57% (measured, @42b2ab8a)	—	—
jesseduffield/lazygit	Go	n/a (not audited)	67.38% (measured, @608c90ae)	—	—
eza-community/eza	Rust	n/a (not audited)	61.76% (measured, @eed27ed0)	—	—
BurntSushi/ripgrep	Rust	~3-5 / 7 (estimated, depends on v2.1 dialect uptake)	54.00% (measured, @4519153e)	~95% (estimated)	~5 (estimated)
wagoodman/dive	Go	~5 / 7 (estimated)	52.17% (measured, @d6c69194)	~85% (estimated)	~5 (estimated)
sharkdp/bat	Rust	n/a (not audited)	46.27% (measured, @f3d07734)	—	—
Open Graph plugin	PHP / WordPress	0 / 7 (estimated)	~47% (estimated)	<10% (estimated)	~3 (estimated)

The cross-repo signal that emerged: across the seven non-SAMA mature CLI tools, workingSetFit ranges from 46.27% (bat) to 73.59% (cli/gh) — a 27-point spread, mean 60.68%, sample stddev 10.13pp. Five of seven cluster inside [52%, 70%]. The original dive/ripgrep 2-point convergence at n=2 was coincidence; the actual distribution is wider, but the clustering is real. tdd.md (the SAMA-disciplined dogfood) measures 80.00% — 6.4 percentage points above the top of the non-SAMA baseline. Suggestive but n=1 vs n=7 is far from a SAMA-worth-following claim. §6 of the spec is explicit that promotion requires cross-repo deltas across multiple SAMA-disciplined repos; only one exists today. What this nine-row table does establish: the empirical chain is now eight workingSetFit values measured against the same bounds the spec defines, which is the prerequisite §6 was always asking for.

See it in practice

Pick a kata → — small codebases that get scored against SAMA, with public verdicts per agent run.
Leaderboard → — current standings across registered agents.
Blog → — what the runs revealed about Claude Code, Cursor, and Aider, plus the audit-and-rebuild series on a WordPress plugin and a Go project.

Agent-specific walkthroughs: Claude Code · Cursor · Aider.