The AI Crawlability Checklist: Beyond robots.txt
Your robots.txt file is one line in a 47-point inspection. AI crawlers from OpenAI, Anthropic, Google, and Perplexity each behave differently, get blocked differently, and index your site differently. This is the full AI crawlability audit — what to check, what to fix, and what most teams miss entirely.
User-agent: GPTBot directive, maybe block or allow it, and call it done. That covers roughly 8% of what actually determines whether an AI crawler can read your site. The other 92% sits in server configurations, JavaScript rendering pipelines, CDN rules, and infrastructure decisions that nobody on the marketing team even knows about.
We audited 63 enterprise sites between October 2025 and February 2026 for AI crawlability. 41 of them had correct robots.txt directives for at least one AI crawler. Only 9 passed a full crawlability audit across all major AI systems. The gap between “we updated robots.txt” and “AI crawlers can actually read our site” is enormous.
Here’s what a real audit looks like, broken into the specific checks that matter, with the exact user agents, configuration changes, and verification methods you need.
Which AI Crawlers Are Hitting Your Site Right Now?
| AI Crawler | User Agent | Default Status | Action Needed |
|---|---|---|---|
| GPTBot | GPTBot/1.0 | Allowed (crawls unless blocked) | Explicit allow in robots.txt; verify in server logs |
| ChatGPT-User | ChatGPT-User | Allowed (live browsing agent) | Allow separately from GPTBot; they serve different functions |
| Claude-Web | Claude-Web | Allowed (respects robots.txt) | Explicit allow; confirm Anthropic IP ranges aren’t blocked at CDN |
| PerplexityBot | PerplexityBot | Blocked by Cloudflare by default | Whitelist in Cloudflare WAF rules; add explicit allow in robots.txt |
| Google-Extended | Google-Extended | Allowed (controls Gemini training, not search) | Allow for AI visibility; blocking only affects Gemini, not Google Search |
| Bytespider | Bytespider | Allowed (TikTok/Doubao AI training) | Allow or block based on whether you want presence in Bytedance AI products |
| Meta-ExternalAgent | Meta-ExternalAgent/1.0 | Allowed (Meta AI training) | Allow or block; controls inclusion in Meta AI responses |
Why Does Cloudflare Block PerplexityBot by Default?
Cloudflare isn’t the only CDN with this problem. Akamai’s Bot Manager and AWS WAF both flag AI crawlers as suspicious by default. If you’re on any enterprise CDN, add “verify AI crawler access” to your quarterly infrastructure review.“We had a client losing 12,000 monthly Perplexity citations because a developer enabled Cloudflare Bot Fight Mode during a DDoS scare and forgot to whitelist AI crawlers. Took 3 minutes to fix. Perplexity re-indexed them within 6 days. That’s the frustrating part — the fix is trivial, but nobody checks.”
Hardik Shah, Founder of ScaleGrowth.Digital
How Do You Audit Server Logs for AI Crawler Activity?
- Total requests. How many times did each crawler visit? GPTBot typically crawls 200-500 pages per day on a mid-size site (10,000+ pages). If you’re seeing zero requests from a crawler you’ve explicitly allowed, something upstream is blocking it.
- Response codes. What percentage returned 200 vs. 403, 429, or 503? Anything above 5% non-200 responses for a specific AI crawler signals a problem. We’ve seen sites returning 429 (rate limiting) to Claude-Web because their rate limiter didn’t distinguish between AI crawlers and scraper bots.
- Pages crawled. Which URLs are AI crawlers actually hitting? If they’re stuck on your homepage and a few top-level pages, your internal linking or sitemap may not be guiding them deeper. A healthy crawl pattern covers at least 60% of your indexable pages within a 30-day window.
- Crawl frequency. How often is each crawler returning? GPTBot tends to revisit pages every 7-14 days. PerplexityBot crawls in real time per query. If GPTBot hasn’t visited in 30+ days, your content isn’t being refreshed in ChatGPT’s training data.
Does JavaScript Rendering Break AI Crawlers?
<div id="app"></div> and nothing else. Your page effectively doesn’t exist for ChatGPT’s training data.
PerplexityBot has limited JavaScript rendering capability, but it’s inconsistent. In our testing across 150 JavaScript-heavy pages, PerplexityBot successfully rendered content about 40% of the time. The other 60%, it either timed out or returned partial content. It appears to have a 5-second rendering budget — if your JavaScript doesn’t execute and populate the DOM within 5 seconds, PerplexityBot gives up.
Google-Extended benefits from Googlebot’s full rendering pipeline, including Chrome-based rendering. This makes it the most JavaScript-friendly AI crawler. But that advantage only helps with Gemini’s access to your content, not the other 6 crawlers.
Claude-Web fetches server-side HTML. It doesn’t execute JavaScript. Same limitation as GPTBot.
The fix is server-side rendering (SSR) or static site generation (SSG). If you’re on a JavaScript framework, ensure that the HTML sent on the initial server response includes your full page content. Next.js, Nuxt, and SvelteKit all support this natively. For single-page applications without SSR, pre-rendering services like Prerender.io or Rendertron can generate static HTML snapshots that bots receive while users get the JavaScript version.
Here’s the quick test: open your browser’s developer tools, disable JavaScript, and reload your page. Whatever you see is what 5 of 7 AI crawlers see. If the page is blank or missing critical content, you have a rendering problem that no amount of robots.txt configuration will fix.
What Is llm.txt and Should You Implement It?
yoursite.com/llm.txt and tells AI crawlers what your site is about, what content matters most, and how to interpret your pages.
The format is simple markdown with a defined structure:
- A title and one-paragraph description of your site
- A list of your most important URLs with brief descriptions
- Optional sections for API documentation, product details, or company information
- An extended version (
llm-full.txt) with more detailed content for models that want deeper context
How Does Sitemap Freshness Affect AI Crawling?
<lastmod> date within the past 30 days get crawled 3.2x more frequently by GPTBot than pages with lastmod dates older than 90 days. That’s not coincidence. AI crawlers, like Googlebot before them, use lastmod as a priority signal.
The problems we see most often:
- No lastmod tags at all. 27% of sitemaps we audited in the past 6 months had no lastmod dates. The crawler has no way to know what’s fresh, so it either crawls everything (expensive) or samples randomly (inefficient).
- Fake lastmod dates. Some CMS configurations update lastmod every time a plugin runs, even when the page content hasn’t changed. This trains crawlers to ignore your lastmod signals. GPTBot appears to compare fetched content against its cache. If the content hasn’t changed but the lastmod has, it may reduce crawl priority for your entire domain.
- Missing pages. New content published but not added to the sitemap. This is common with custom-built sites that don’t auto-generate sitemaps. If a page isn’t in the sitemap and isn’t linked internally from at least 2-3 other pages, AI crawlers may never find it.
- Sitemap file size. Sitemaps over 50MB or 50,000 URLs need to be split into sitemap index files. Several AI crawlers timeout on oversized sitemaps and abandon the parse entirely.
Sitemap: directive. 18% of sites we audited had a sitemap but never told crawlers where to find it.
Does Page Speed Matter for AI Crawlers?
- Pages responding under 1.5 seconds: crawled by all AI crawlers that had access
- Pages responding in 1.5-3 seconds: crawled by most, but PerplexityBot dropped 23% of these
- Pages responding in 3-5 seconds: PerplexityBot dropped 58%, GPTBot reduced revisit frequency by roughly half
- Pages responding over 5 seconds: effectively invisible to PerplexityBot, GPTBot crawled but rarely re-crawled
“Speed optimization for AI crawlers is the highest-ROI technical work you can do right now. One client moved from a 3.8-second response time to 1.1 seconds, and their GPTBot crawl volume tripled in 2 weeks. Same content, same robots.txt, same everything. Just faster servers.”
Hardik Shah, Founder of ScaleGrowth.Digital
What Does the Complete AI Crawlability Checklist Look Like?
- Explicit
User-agentdirectives for GPTBot, ChatGPT-User, Claude-Web, PerplexityBot, Google-Extended, Bytespider, and Meta-ExternalAgent - Separate allow/disallow rules per crawler (don’t rely on
User-agent: *for AI bots) Sitemap:directive pointing to your XML sitemap- No accidental wildcards blocking AI crawlers (check for
Disallow: /underUser-agent: *) - Validate with Google’s robots.txt tester and manual cURL tests for each user agent
- Cloudflare: Bot Fight Mode configured to exclude AI crawlers
- Cloudflare: WAF custom rule whitelisting PerplexityBot user agent
- AWS WAF / Akamai / Sucuri: verify AI crawler user agents aren’t in bot block lists
- Rate limiting: AI crawlers exempted or given higher thresholds (minimum 120 requests/minute)
- Geographic blocking: confirm no country-level blocks affecting crawler origin IPs (GPTBot: US; PerplexityBot: US; Claude-Web: US)
- Confirm each allowed AI crawler appears in logs within the past 14 days
- 200 response rate above 95% per crawler
- No unexpected 403, 429, or 503 responses
- Crawl depth: AI crawlers reaching pages beyond level 2 of your site architecture
- Crawl frequency: GPTBot visiting at least weekly, Perplexity in real time
- Disable JavaScript in browser and verify all critical content renders in HTML source
- SSR or SSG configured for all JavaScript framework pages
- Pre-rendering service active if SSR isn’t possible
- No content hidden behind click-to-expand, tabs, or accordions that require JS to reveal
- Fetch and render test using Google Search Console’s URL Inspection tool (approximates AI crawler behavior)
- XML sitemap accessible and returns 200
- All indexable pages included (compare against Screaming Frog crawl)
- Accurate
<lastmod>dates reflecting actual content changes - Sitemap size under 50,000 URLs / 50MB per file (use sitemap index if needed)
- Sitemap URL declared in robots.txt
- No non-canonical or redirecting URLs in sitemap
- TTFB under 800ms for top 50 pages
- Full HTML response under 1.5 seconds
- Server-side caching active (Redis, Varnish, or equivalent)
- No render-blocking resources preventing initial HTML delivery
- CDN configured for HTML caching (not just static assets)
- llm.txt file created and deployed at domain root
- llm.txt contains site description, priority URLs, and content categories
- Optional: llm-full.txt with extended content summaries
- Monitor for new standards (ai.txt, model-access.txt proposals are in discussion)
What Happens After You Fix AI Crawlability?
Find Out What AI Crawlers Actually See on Your Site
We audit your site across all 7 AI crawlers, fix access issues, and build the technical foundation for AI visibility. Most fixes take under a week. Get Your AI Crawlability Audit →