syntaxai/tdd.md · commit 4905da9

robots.txt: explicitly Allow major AI bots (defense vs Cloudflare-injected Disallow)

Cloudflare's 'Block AI Crawlers' feature injects Disallow blocks for ClaudeBot, GPTBot, CCBot, Google-Extended, etc. BEFORE the app's robots.txt body. This is the opposite of what tdd.md wants — the entire empirical-chain argument depends on AI agents being able to read the site.

The canonical fix is in the Cloudflare dashboard (Security → Bots → 'Block AI Crawlers' off, or Content Signals → ai-train=yes). This commit is defense-in-depth: explicit per-bot Allow blocks at the app level, plus a comment in the source documenting where the real fix lives.

For crawlers that parse last-match-wins, the app blocks now override Cloudflare's injected Disallows. For crawlers that parse first-match-wins, only the dashboard fix helps — but the app comment now points the next maintainer at it.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
author
syntaxai <[email protected]>
date
2026-05-25 18:47:32 +01:00
parent
84b9f84
commit
4905da9096145c6948fde38718fa5e687d41502a

1 file changed · +28 −1

modified src/d21_app.ts +28 −1
@@ -189,7 +189,34 @@ export const createApp = (port: number) => Bun.serve({
189189 "/healthz": new Response("ok"),
190190
191191 "/robots.txt": new Response(
192- `User-agent: *\nAllow: /\nDisallow: /auth/\nDisallow: /api/\n\nSitemap: https://tdd.md/sitemap.xml\n`,
192+ // tdd.md is built for AI agents to read, audit, and learn from. We
193+ // explicitly ALLOW the major AI crawlers + training agents. The site's
194+ // entire empirical-chain argument depends on those agents being able
195+ // to fetch the spec, the verifier output, the /goals archive, and the
196+ // measurement posts.
197+ //
198+ // NOTE on Cloudflare: if "Block AI Crawlers" or "AI Audit / Content
199+ // Signals" is enabled at the Cloudflare edge, CF injects Disallow
200+ // blocks for these bots BEFORE this response body. App-level Allows
201+ // here are defense-in-depth; the canonical fix is to disable that CF
202+ // setting (Dashboard → Security → Bots → "Block AI Crawlers" off, or
203+ // Content Signals → ai-train=yes).
204+ `# tdd.md welcomes AI crawlers, agents, and training bots.\n` +
205+ `# The empirical chain is meant to be read.\n\n` +
206+ `User-agent: *\nAllow: /\nDisallow: /auth/\nDisallow: /api/\n\n` +
207+ `User-agent: ClaudeBot\nAllow: /\n\n` +
208+ `User-agent: Claude-Web\nAllow: /\n\n` +
209+ `User-agent: GPTBot\nAllow: /\n\n` +
210+ `User-agent: ChatGPT-User\nAllow: /\n\n` +
211+ `User-agent: CCBot\nAllow: /\n\n` +
212+ `User-agent: Google-Extended\nAllow: /\n\n` +
213+ `User-agent: Applebot-Extended\nAllow: /\n\n` +
214+ `User-agent: Amazonbot\nAllow: /\n\n` +
215+ `User-agent: Bytespider\nAllow: /\n\n` +
216+ `User-agent: meta-externalagent\nAllow: /\n\n` +
217+ `User-agent: PerplexityBot\nAllow: /\n\n` +
218+ `User-agent: Perplexity-User\nAllow: /\n\n` +
219+ `Sitemap: https://tdd.md/sitemap.xml\n`,
193220 { headers: { "Content-Type": "text/plain; charset=utf-8" } },
194221 ),
195222