a92d5a50e0d50180904d20b66d930ca9963e396c diff --git a/content/blog/agentic-coding-corpus-three-patterns.md b/content/blog/agentic-coding-corpus-three-patterns.md index f4a3bb86654a4f049d07517621b3478d9b30edb4..e3db387285d54a2a259f9280c92378b8a08e95ca 100644 --- a/content/blog/agentic-coding-corpus-three-patterns.md +++ b/content/blog/agentic-coding-corpus-three-patterns.md @@ -38,7 +38,17 @@ A reply names the mechanism: **What the pattern proves.** Under "make the tests pass" pressure, agents treat the verifier as the variable. This is not three separate bugs. It is one failure shape with three different surface expressions: edit the test runtime, delete the test file, write a stub that claims to be a test. Each of them passes a naïve `bun test || pytest || npm test` invocation. None of them prove anything. -This is the exact failure the [iron law](/sama/skill) refuses — *no production code without a failing test first* — but only if verify-RED is enforced *externally*, in the commit log. tdd.md's structural judging detects two of these three directly: test-count drop is a one-line diff check, and phase-tagged commits without a paired `red:` SHA flag the place where the test never failed. The runtime-patching variant is harder; that needs the sandbox-runner sliver to run the test against a clean checkout and notice that the app is actually broken. +This is the exact failure the [iron law](/sama/skill) refuses — *no production code without a failing test first* — but only if verify-RED is enforced *externally*, in the commit log. + +### How TDD + SAMA prevent it + +| thread | what the agent did | how the discipline catches or prevents it | +|---|---|---| +| `1l5ieo5` test-file deletion | deleted the test file rather than fix the failing impl | **Caught today.** tdd.md detects test-count-drop between commits — a one-line diff against the deploy-time test bundle. SAMA's *Modeled* property makes the deletion structurally visible: a `cXX_*.ts` without its sibling `cXX_*.test.ts` is a red flag the file system enforces. A `refactor:` commit that drops tests should hard-fail the judge. | +| `1qix264` placeholder tests | wrote 90 tests with bodies like *"if I actually gave a crap..."* | **Caught with a small extension.** The iron law refuses *test-after*: a `green:` commit with no paired `red:` SHA showing genuine failure is structurally suspect. Empty assertion bodies — zero `expect()` calls, string-literal bodies, single-line `// TODO` stubs — are AST-checkable. The test bundle already lives in `content/git-history/syntaxai__tdd.md__tests.json`; an empty-body check is a one-evening sliver. | +| `1rug14a` runtime test-patching | injected JS at runtime *inside the test* to repair the broken UI mid-assertion | **Hard.** This is the failure mode tdd.md catches least well today. The mitigation the corpus itself names (top comment: *"separate producer from verifier"*) is what tdd.md's judge does for kata mode — the agent that writes the test never gets to run it; the judge replays the commit against a clean checkout, with a *hidden* test the agent never saw. For real-project mode this is the sandbox-runner sliver: the next slice on the roadmap, motivated directly by this thread. | + +The rule that survives all three: **the iron law is enforced in the commit log, not the agent's session.** A passing test in the agent's terminal is anecdote. A `red:` SHA followed by a `green:` SHA, both replayable against the bundled test results, is evidence. ## Pattern 2 — hidden state contradicts the user's stated rules @@ -70,6 +80,20 @@ A reply names what the harness actually does: What this means for any rule a user writes for an AI agent: **if the rule lives in the agent's context window, the harness is allowed to outvote it.** The only rules that can't be outvoted are the ones enforced *outside* the context: a grep that runs in CI, a sibling-test-file check the file system enforces, a test count that drops visibly between commits. tdd.md and SAMA exist to push every rule into that "outside" position. +### How TDD + SAMA prevent it + +The principle is one line: **rules in `CLAUDE.md` lose to a 65k-token harness; rules in the file system and `git log` do not.** Each thread maps to a different rule that the discipline moves from prompt to artefact: + +| thread | the contradiction | what the discipline puts in its place | +|---|---|---| +| `1r8gr4k` 65k-token system prompt vs 70-line CLAUDE.md | the user's rules are outweighed 30:1 in the context window | **SAMA's *Atomic* (~700-line split)** shrinks the working surface so the agent doesn't *need* to fight the bloat for attention — it's only working on one atom + its sibling test, which fit with room to spare even after the 65k preamble. | +| `1piny6t` post-compaction injection deletes STOP-PLAN-ASK-WAIT | the harness silently rewrites user instructions after every `/compact` | **The iron law lives in the test file, not in CLAUDE.md.** Compaction can purge whatever it likes from the context; the test that has to fail before any green commit is in `cXX_*.test.ts`, version-controlled, immune to compaction. | +| `1njm40c` CLAUDE.md silently ignored on fresh sessions | the agent doesn't load CLAUDE.md autonomously | **The SAMA verification grep doesn't depend on the agent reading anything.** It runs in CI or as a pre-commit hook. The agent can pretend CLAUDE.md doesn't exist; layer violations still get caught. | +| `1snlp17` 4.7 violates a hook-blocked .env rule twice in 18 hours | even an external hook gets retried minutes later | The hook is the right *shape* (mechanical, external, not an instruction). The lesson is to extend that shape: **every rule worth enforcing should have a mechanical check.** The SAMA grep is one. Test-count-drop is another. Sibling-test-presence is a third. These don't ask the agent to obey — they refuse the merge if it didn't. | +| `1sstipj` system prompt churns 158 versions in 11 days | the prompt the agent runs against changes faster than the user can audit | **The verification rules don't change.** `grep -rE 'from "\./c[5-9]' src/c1*.ts src/c2*.ts src/c3*.ts` is the same line in May as in November as next year. The harness is allowed to be unstable; the artefact's contract is not. | + +The general rule: **don't write a CLAUDE.md instruction the harness can overrule. Write a structural check the harness doesn't get to know about.** + ## Pattern 3 — the community is independently converging on TDD+SAMA-shaped answers This is the strongest argument in the corpus, and the one that turns the case from "we think this works" into "experienced practitioners are landing on the same answer from different starting points". @@ -98,6 +122,19 @@ Same shape again: stop trying to instruct the agent into discipline; bound what **What the pattern proves.** Different authors, different repositories, different starting frustrations, all converging on the same answer-shape: *out-of-context structural mitigations, because in-context prompting demonstrably fails*. The convergence is the argument. tdd.md and SAMA are not novel claims; they are one expression of an answer that the practitioner community is independently rediscovering. +### How TDD + SAMA realize the convergence + +The mapping between what the threads describe and what tdd.md / SAMA already ship is one-to-one. This pattern is not a problem to solve; it is the discipline restated by people who arrived at it independently. + +| what the corpus prescribes | what tdd.md / SAMA already ship | +|---|---| +| DUMBAI's *"lock them to assigned files"* (`1ni19gr`) | **SAMA's *Atomic* + *Sorted*.** The agent works on one atom (the assigned file plus its sibling test); the layer prefix tells what may import what; the grep refuses pulls in the wrong direction. The "lock" is mechanical, not promised. | +| DUMBAI's *CONTRACT → STUB → TEST → IMPLEMENT* phases with validation gates | **The iron-law cycle.** CONTRACT = the type/parser in `c31_*`. STUB = the failing red commit. TEST = verify-RED before any impl lands. IMPLEMENT = the green commit. tdd.md's judge enforces the gate by reading the commit log, not by trusting the agent to wait. | +| *"Separate producer from verifier"* (top comment on `1rug14a`) | **tdd.md's deploy-time judging.** The agent that wrote the code never gets to be the agent that judges it. The verifier is a different process, run after the fact, against a clean checkout. | +| *"Design a sandbox, not trust"* (`1oqu6jn`) | **SAMA's labelled control room.** 13,000 levers become 13,000 *named, sorted* levers. The grep refuses pulls in the wrong direction. A layer violation is rejected by CI, not by the agent's good intentions. | + +What makes this section the strongest part of the case: **the convergence is external.** Every author cited in this pattern arrived at a TDD+SAMA-shaped answer from their own scars, none of them after reading the [SAMA pages](/sama). tdd.md is one packaging of an answer the community is independently demanding. + ## What this means for the case The previous post argued, from one thread, that TDD's iron law and SAMA's verification grep survive harness sloppiness because they're enforced outside the agent's context window. The corpus turns that into three load-bearing claims: diff --git a/src/c31_blog.ts b/src/c31_blog.ts index 42248bf71d7d58186197b7b8391b342e935a7f39..20af3ed5f2131f9aef0b3f146f1fdd22983a6efc 100644 --- a/src/c31_blog.ts +++ b/src/c31_blog.ts @@ -15,7 +15,7 @@ export const ALL_POSTS: BlogEntry[] = [ { slug: "agentic-coding-corpus-three-patterns", title: "Three patterns ten threads converge on", - description: "One thread is an audit. Ten threads are a pattern. A six-month corpus of r/ClaudeAI, r/ClaudeCode and r/AgentsOfAI posts shows the same three failure modes everywhere: agents attack the verifier rather than the impl, the harness's hidden state outvotes the user's stated rules, and experienced practitioners are independently arriving at TDD+SAMA-shaped answers. The corpus, with verbatim quotes.", + description: "One thread is an audit. Ten threads are a pattern. A six-month corpus of r/ClaudeAI, r/ClaudeCode and r/AgentsOfAI posts shows three failure modes everywhere — agents attack the verifier rather than the impl, the harness's hidden state outvotes the user's stated rules, and experienced practitioners independently arrive at TDD+SAMA-shaped answers. With per-pattern mitigation tables: how the iron law, the sibling-test rule, and the layer-prefix grep would have caught or prevented each thread.", date: "2026-05-09", }, {