syntaxai/tdd.md · commit 4905da9

robots.txt: explicitly Allow major AI bots (defense vs Cloudflare-injected Disallow)

Cloudflare's 'Block AI Crawlers' feature injects Disallow blocks for ClaudeBot, GPTBot, CCBot, Google-Extended, etc. BEFORE the app's robots.txt body. This is the opposite of what tdd.md wants — the entire empirical-chain argument depends on AI agents being able to read the site.

The canonical fix is in the Cloudflare dashboard (Security → Bots → 'Block AI Crawlers' off, or Content Signals → ai-train=yes). This commit is defense-in-depth: explicit per-bot Allow blocks at the app level, plus a comment in the source documenting where the real fix lives.

For crawlers that parse last-match-wins, the app blocks now override Cloudflare's injected Disallows. For crawlers that parse first-match-wins, only the dashboard fix helps — but the app comment now points the next maintainer at it.

Co-Authored-By: Claude Opus 4.7 <[email protected]>

author: syntaxai <[email protected]>
date: 2026-05-25 18:47:32 +01:00
parent: 84b9f84
commit: 4905da9096145c6948fde38718fa5e687d41502a

1 file changed · +28 −1

modified src/d21_app.ts +28 −1

@@ -189,7 +189,34 @@ export const createApp = (port: number) => Bun.serve({
189	189	"/healthz": new Response("ok"),
190	190
191	191	"/robots.txt": new Response(
192		- `User-agent: *\nAllow: /\nDisallow: /auth/\nDisallow: /api/\n\nSitemap: https://tdd.md/sitemap.xml\n`,
	192	+ // tdd.md is built for AI agents to read, audit, and learn from. We
	193	+ // explicitly ALLOW the major AI crawlers + training agents. The site's
	194	+ // entire empirical-chain argument depends on those agents being able
	195	+ // to fetch the spec, the verifier output, the /goals archive, and the
	196	+ // measurement posts.
	197	+ //
	198	+ // NOTE on Cloudflare: if "Block AI Crawlers" or "AI Audit / Content
	199	+ // Signals" is enabled at the Cloudflare edge, CF injects Disallow
	200	+ // blocks for these bots BEFORE this response body. App-level Allows
	201	+ // here are defense-in-depth; the canonical fix is to disable that CF
	202	+ // setting (Dashboard → Security → Bots → "Block AI Crawlers" off, or
	203	+ // Content Signals → ai-train=yes).
	204	+ `# tdd.md welcomes AI crawlers, agents, and training bots.\n` +
	205	+ `# The empirical chain is meant to be read.\n\n` +
	206	+ `User-agent: *\nAllow: /\nDisallow: /auth/\nDisallow: /api/\n\n` +
	207	+ `User-agent: ClaudeBot\nAllow: /\n\n` +
	208	+ `User-agent: Claude-Web\nAllow: /\n\n` +
	209	+ `User-agent: GPTBot\nAllow: /\n\n` +
	210	+ `User-agent: ChatGPT-User\nAllow: /\n\n` +
	211	+ `User-agent: CCBot\nAllow: /\n\n` +
	212	+ `User-agent: Google-Extended\nAllow: /\n\n` +
	213	+ `User-agent: Applebot-Extended\nAllow: /\n\n` +
	214	+ `User-agent: Amazonbot\nAllow: /\n\n` +
	215	+ `User-agent: Bytespider\nAllow: /\n\n` +
	216	+ `User-agent: meta-externalagent\nAllow: /\n\n` +
	217	+ `User-agent: PerplexityBot\nAllow: /\n\n` +
	218	+ `User-agent: Perplexity-User\nAllow: /\n\n` +
	219	+ `Sitemap: https://tdd.md/sitemap.xml\n`,
193	220	{ headers: { "Content-Type": "text/plain; charset=utf-8" } },
194	221	),
195	222

raw .diff