robots.txt: explicitly Allow major AI bots (defense vs Cloudflare-injected Disallow)
Cloudflare's 'Block AI Crawlers' feature injects Disallow blocks for ClaudeBot, GPTBot, CCBot, Google-Extended, etc. BEFORE the app's robots.txt body. This is the opposite of what tdd.md wants — the entire empirical-chain argument depends on AI agents being able to read the site. The canonical fix is in the Cloudflare dashboard (Security → Bots → 'Block AI Crawlers' off, or Content Signals → ai-train=yes). This commit is defense-in-depth: explicit per-bot Allow blocks at the app level, plus a comment in the source documenting where the real fix lives. For crawlers that parse last-match-wins, the app blocks now override Cloudflare's injected Disallows. For crawlers that parse first-match-wins, only the dashboard fix helps — but the app comment now points the next maintainer at it. Co-Authored-By: Claude Opus 4.7 <[email protected]>
1 file changed · +28 −1
src/d21_app.ts
+28
−1
| @@ -189,7 +189,34 @@ export const createApp = (port: number) => Bun.serve({ | ||
| 189 | 189 | "/healthz": new Response("ok"), |
| 190 | 190 | |
| 191 | 191 | "/robots.txt": new Response( |
| 192 | - `User-agent: *\nAllow: /\nDisallow: /auth/\nDisallow: /api/\n\nSitemap: https://tdd.md/sitemap.xml\n`, | |
| 192 | + // tdd.md is built for AI agents to read, audit, and learn from. We | |
| 193 | + // explicitly ALLOW the major AI crawlers + training agents. The site's | |
| 194 | + // entire empirical-chain argument depends on those agents being able | |
| 195 | + // to fetch the spec, the verifier output, the /goals archive, and the | |
| 196 | + // measurement posts. | |
| 197 | + // | |
| 198 | + // NOTE on Cloudflare: if "Block AI Crawlers" or "AI Audit / Content | |
| 199 | + // Signals" is enabled at the Cloudflare edge, CF injects Disallow | |
| 200 | + // blocks for these bots BEFORE this response body. App-level Allows | |
| 201 | + // here are defense-in-depth; the canonical fix is to disable that CF | |
| 202 | + // setting (Dashboard → Security → Bots → "Block AI Crawlers" off, or | |
| 203 | + // Content Signals → ai-train=yes). | |
| 204 | + `# tdd.md welcomes AI crawlers, agents, and training bots.\n` + | |
| 205 | + `# The empirical chain is meant to be read.\n\n` + | |
| 206 | + `User-agent: *\nAllow: /\nDisallow: /auth/\nDisallow: /api/\n\n` + | |
| 207 | + `User-agent: ClaudeBot\nAllow: /\n\n` + | |
| 208 | + `User-agent: Claude-Web\nAllow: /\n\n` + | |
| 209 | + `User-agent: GPTBot\nAllow: /\n\n` + | |
| 210 | + `User-agent: ChatGPT-User\nAllow: /\n\n` + | |
| 211 | + `User-agent: CCBot\nAllow: /\n\n` + | |
| 212 | + `User-agent: Google-Extended\nAllow: /\n\n` + | |
| 213 | + `User-agent: Applebot-Extended\nAllow: /\n\n` + | |
| 214 | + `User-agent: Amazonbot\nAllow: /\n\n` + | |
| 215 | + `User-agent: Bytespider\nAllow: /\n\n` + | |
| 216 | + `User-agent: meta-externalagent\nAllow: /\n\n` + | |
| 217 | + `User-agent: PerplexityBot\nAllow: /\n\n` + | |
| 218 | + `User-agent: Perplexity-User\nAllow: /\n\n` + | |
| 219 | + `Sitemap: https://tdd.md/sitemap.xml\n`, | |
| 193 | 220 | { headers: { "Content-Type": "text/plain; charset=utf-8" } }, |
| 194 | 221 | ), |
| 195 | 222 | |