Mumbai, India
March 20, 2026

The AI Crawlability Checklist: Beyond robots.txt

AI Visibility

The AI Crawlability Checklist: Beyond robots.txt

Your robots.txt file is one line in a 47-point inspection. AI crawlers from OpenAI, Anthropic, Google, and Perplexity each behave differently, get blocked differently, and index your site differently. This is the full AI crawlability audit — what to check, what to fix, and what most teams miss entirely.

An AI crawlability checklist is the complete set of technical checks that determine whether AI systems like ChatGPT, Gemini, Perplexity, and Claude can access, render, and index your website’s content. It covers robots.txt directives, server-side rendering, crawler-specific user agents, CDN configurations, llm.txt files, page speed thresholds, and sitemap freshness. Most technical SEO teams stop at robots.txt. They’ll add a User-agent: GPTBot directive, maybe block or allow it, and call it done. That covers roughly 8% of what actually determines whether an AI crawler can read your site. The other 92% sits in server configurations, JavaScript rendering pipelines, CDN rules, and infrastructure decisions that nobody on the marketing team even knows about. We audited 63 enterprise sites between October 2025 and February 2026 for AI crawlability. 41 of them had correct robots.txt directives for at least one AI crawler. Only 9 passed a full crawlability audit across all major AI systems. The gap between “we updated robots.txt” and “AI crawlers can actually read our site” is enormous. Here’s what a real audit looks like, broken into the specific checks that matter, with the exact user agents, configuration changes, and verification methods you need.

Which AI Crawlers Are Hitting Your Site Right Now?

Before you configure anything, you need to know who’s knocking on the door. There are 7 major AI crawlers active in 2026, each with different user agents, crawl behaviors, and default access rules. Some you’ve heard of. A few will surprise you. The table below covers every AI crawler you need to account for. “Default Status” means what happens if your robots.txt says nothing about them.
AI Crawler User Agent Default Status Action Needed
GPTBot GPTBot/1.0 Allowed (crawls unless blocked) Explicit allow in robots.txt; verify in server logs
ChatGPT-User ChatGPT-User Allowed (live browsing agent) Allow separately from GPTBot; they serve different functions
Claude-Web Claude-Web Allowed (respects robots.txt) Explicit allow; confirm Anthropic IP ranges aren’t blocked at CDN
PerplexityBot PerplexityBot Blocked by Cloudflare by default Whitelist in Cloudflare WAF rules; add explicit allow in robots.txt
Google-Extended Google-Extended Allowed (controls Gemini training, not search) Allow for AI visibility; blocking only affects Gemini, not Google Search
Bytespider Bytespider Allowed (TikTok/Doubao AI training) Allow or block based on whether you want presence in Bytedance AI products
Meta-ExternalAgent Meta-ExternalAgent/1.0 Allowed (Meta AI training) Allow or block; controls inclusion in Meta AI responses
A critical distinction that most teams miss: GPTBot and ChatGPT-User are separate crawlers with separate purposes. GPTBot crawls for training data. ChatGPT-User crawls when a ChatGPT user asks the model to browse a live URL. Blocking GPTBot but allowing ChatGPT-User means your site won’t train future models, but current ChatGPT users can still access your pages in real time. About 35% of the sites we audited had blocked one but not the other, without understanding the difference.

Why Does Cloudflare Block PerplexityBot by Default?

This is the single most common AI crawlability failure we see. Cloudflare’s Bot Fight Mode, enabled by default on all plans including free, treats PerplexityBot as a suspected automated threat and blocks it with a JavaScript challenge or a 403 response. Of 63 sites we audited, 29 used Cloudflare. 26 of those 29 were blocking PerplexityBot without knowing it. That’s 90% of Cloudflare sites, silently invisible to one of the fastest-growing AI search engines. Perplexity processes over 15 million queries daily as of January 2026. It cites sources more transparently than any other AI platform, with direct links back to your content. Being blocked from Perplexity means losing referral traffic that’s growing at 40% quarter-over-quarter. How to fix it. In your Cloudflare dashboard, go to Security > WAF > Custom Rules. Create a rule that matches the user agent “PerplexityBot” and set the action to “Skip” for all security features. Alternatively, add PerplexityBot’s published IP ranges to your IP Access Rules as “Allow.” Both approaches take under 5 minutes. Then verify the fix. Use a cURL command with PerplexityBot’s user agent string against your site. You should get a 200 response, not a 403 or a Cloudflare challenge page. Check again in 48 hours — Cloudflare’s bot detection model updates regularly, and some configurations revert.

“We had a client losing 12,000 monthly Perplexity citations because a developer enabled Cloudflare Bot Fight Mode during a DDoS scare and forgot to whitelist AI crawlers. Took 3 minutes to fix. Perplexity re-indexed them within 6 days. That’s the frustrating part — the fix is trivial, but nobody checks.”

Hardik Shah, Founder of ScaleGrowth.Digital

Cloudflare isn’t the only CDN with this problem. Akamai’s Bot Manager and AWS WAF both flag AI crawlers as suspicious by default. If you’re on any enterprise CDN, add “verify AI crawler access” to your quarterly infrastructure review.

How Do You Audit Server Logs for AI Crawler Activity?

Robots.txt tells crawlers what they’re allowed to do. Server logs tell you what they actually did. The gap between the two is where most crawlability problems hide. Pull your server access logs for the past 30 days and filter for these user agent strings: GPTBot, ChatGPT-User, Claude-Web, PerplexityBot, Google-Extended, Bytespider, and Meta-ExternalAgent. For each crawler, you want 4 data points:
  • Total requests. How many times did each crawler visit? GPTBot typically crawls 200-500 pages per day on a mid-size site (10,000+ pages). If you’re seeing zero requests from a crawler you’ve explicitly allowed, something upstream is blocking it.
  • Response codes. What percentage returned 200 vs. 403, 429, or 503? Anything above 5% non-200 responses for a specific AI crawler signals a problem. We’ve seen sites returning 429 (rate limiting) to Claude-Web because their rate limiter didn’t distinguish between AI crawlers and scraper bots.
  • Pages crawled. Which URLs are AI crawlers actually hitting? If they’re stuck on your homepage and a few top-level pages, your internal linking or sitemap may not be guiding them deeper. A healthy crawl pattern covers at least 60% of your indexable pages within a 30-day window.
  • Crawl frequency. How often is each crawler returning? GPTBot tends to revisit pages every 7-14 days. PerplexityBot crawls in real time per query. If GPTBot hasn’t visited in 30+ days, your content isn’t being refreshed in ChatGPT’s training data.
If your hosting provider doesn’t give you raw access logs (common with managed WordPress hosts), install a server-side logging plugin or use a log drain service like Logflare or Papertrail. Google Analytics and similar JavaScript-based tools won’t capture bot traffic at all.

Does JavaScript Rendering Break AI Crawlers?

Short answer: yes, for most of them. Longer answer: it depends on the crawler, and the failure modes are different for each one. GPTBot does not render JavaScript. It fetches raw HTML and parses whatever it gets in the initial server response. If your critical content loads via React, Vue, Angular, or any client-side JavaScript framework, GPTBot sees an empty <div id="app"></div> and nothing else. Your page effectively doesn’t exist for ChatGPT’s training data. PerplexityBot has limited JavaScript rendering capability, but it’s inconsistent. In our testing across 150 JavaScript-heavy pages, PerplexityBot successfully rendered content about 40% of the time. The other 60%, it either timed out or returned partial content. It appears to have a 5-second rendering budget — if your JavaScript doesn’t execute and populate the DOM within 5 seconds, PerplexityBot gives up. Google-Extended benefits from Googlebot’s full rendering pipeline, including Chrome-based rendering. This makes it the most JavaScript-friendly AI crawler. But that advantage only helps with Gemini’s access to your content, not the other 6 crawlers. Claude-Web fetches server-side HTML. It doesn’t execute JavaScript. Same limitation as GPTBot. The fix is server-side rendering (SSR) or static site generation (SSG). If you’re on a JavaScript framework, ensure that the HTML sent on the initial server response includes your full page content. Next.js, Nuxt, and SvelteKit all support this natively. For single-page applications without SSR, pre-rendering services like Prerender.io or Rendertron can generate static HTML snapshots that bots receive while users get the JavaScript version. Here’s the quick test: open your browser’s developer tools, disable JavaScript, and reload your page. Whatever you see is what 5 of 7 AI crawlers see. If the page is blank or missing critical content, you have a rendering problem that no amount of robots.txt configuration will fix.

What Is llm.txt and Should You Implement It?

llm.txt is a proposed standard (published by Jeremy Howard in late 2024) that provides AI systems with a structured, machine-readable summary of your website. Think of it as a README file for language models. It sits at yoursite.com/llm.txt and tells AI crawlers what your site is about, what content matters most, and how to interpret your pages. The format is simple markdown with a defined structure:
  • A title and one-paragraph description of your site
  • A list of your most important URLs with brief descriptions
  • Optional sections for API documentation, product details, or company information
  • An extended version (llm-full.txt) with more detailed content for models that want deeper context
As of March 2026, llm.txt isn’t officially supported by any major AI crawler as a ranking or crawl-priority signal. No crawler has confirmed it reads or acts on llm.txt. But there are practical reasons to implement it now. First, the cost is near zero. Creating an llm.txt file takes 30 minutes. It’s a single text file. No infrastructure changes, no code deployments. Second, early adopters of robots.txt in the mid-1990s gained advantages before search engines formally standardized it. The same pattern may repeat. Third, even without crawler support, llm.txt serves as useful internal documentation, forcing you to articulate which pages matter most and why. Our recommendation: create it, deploy it, but don’t count on it as a crawlability factor yet. Treat it as a low-cost bet with asymmetric upside. Roughly 3% of sites in the Cloudflare Radar top 10,000 had an llm.txt file as of February 2026. That number is growing at about 1.5% per month.

How Does Sitemap Freshness Affect AI Crawling?

Your XML sitemap is the roadmap AI crawlers use to discover and prioritize your pages. A stale sitemap doesn’t just slow down traditional Google crawling — it directly impacts how frequently AI systems re-index your content. GPTBot and Claude-Web both reference XML sitemaps when deciding which pages to crawl. Our log analysis shows that pages listed in the sitemap with a <lastmod> date within the past 30 days get crawled 3.2x more frequently by GPTBot than pages with lastmod dates older than 90 days. That’s not coincidence. AI crawlers, like Googlebot before them, use lastmod as a priority signal. The problems we see most often:
  • No lastmod tags at all. 27% of sitemaps we audited in the past 6 months had no lastmod dates. The crawler has no way to know what’s fresh, so it either crawls everything (expensive) or samples randomly (inefficient).
  • Fake lastmod dates. Some CMS configurations update lastmod every time a plugin runs, even when the page content hasn’t changed. This trains crawlers to ignore your lastmod signals. GPTBot appears to compare fetched content against its cache. If the content hasn’t changed but the lastmod has, it may reduce crawl priority for your entire domain.
  • Missing pages. New content published but not added to the sitemap. This is common with custom-built sites that don’t auto-generate sitemaps. If a page isn’t in the sitemap and isn’t linked internally from at least 2-3 other pages, AI crawlers may never find it.
  • Sitemap file size. Sitemaps over 50MB or 50,000 URLs need to be split into sitemap index files. Several AI crawlers timeout on oversized sitemaps and abandon the parse entirely.
The fix: set your CMS to update lastmod only when page content actually changes. Verify your sitemap weekly using a tool like Screaming Frog or a simple script that compares sitemap URLs against your live page inventory. And make sure your sitemap URL is declared in robots.txt with a Sitemap: directive. 18% of sites we audited had a sitemap but never told crawlers where to find it.

Does Page Speed Matter for AI Crawlers?

It matters more than most teams realize, and the threshold is different from Google’s Core Web Vitals. Traditional Googlebot is patient. It’ll wait 10+ seconds for a page to respond. AI crawlers are not. PerplexityBot, which crawls in real time when a user submits a query, has an effective timeout of about 8 seconds. If your server doesn’t respond within that window, Perplexity skips your page and cites a faster competitor. GPTBot appears to have a 15-second timeout, but it deprioritizes slow-responding domains on subsequent crawls. We measured response times for 4,200 pages across 63 sites and correlated them with AI crawler behavior in server logs. The pattern was clear:
  • Pages responding under 1.5 seconds: crawled by all AI crawlers that had access
  • Pages responding in 1.5-3 seconds: crawled by most, but PerplexityBot dropped 23% of these
  • Pages responding in 3-5 seconds: PerplexityBot dropped 58%, GPTBot reduced revisit frequency by roughly half
  • Pages responding over 5 seconds: effectively invisible to PerplexityBot, GPTBot crawled but rarely re-crawled
The target for AI crawlability is a Time to First Byte (TTFB) under 800 milliseconds and a full HTML response under 1.5 seconds. That’s stricter than Google’s “good” TTFB threshold of 800ms because AI crawlers care about the full response, not just the first byte. Common speed killers for AI crawlers specifically: unoptimized database queries on WordPress (add object caching with Redis), heavy server-side processing before sending HTML (defer non-critical operations), and geographic distance between your server and the crawler’s origin (GPTBot crawls primarily from US-based IPs, so US-hosted sites see faster crawl completion).

“Speed optimization for AI crawlers is the highest-ROI technical work you can do right now. One client moved from a 3.8-second response time to 1.1 seconds, and their GPTBot crawl volume tripled in 2 weeks. Same content, same robots.txt, same everything. Just faster servers.”

Hardik Shah, Founder of ScaleGrowth.Digital

What Does the Complete AI Crawlability Checklist Look Like?

Here’s the full checklist we use at ScaleGrowth.Digital, the growth engineering firm behind AI visibility audits for brands across BFSI, SaaS, ecommerce, and healthcare. We run this for every technical SEO engagement. Each item takes 5-30 minutes to check. Total time for the full audit: 6-10 hours for a mid-size site. Robots.txt Configuration (30 minutes)
  • Explicit User-agent directives for GPTBot, ChatGPT-User, Claude-Web, PerplexityBot, Google-Extended, Bytespider, and Meta-ExternalAgent
  • Separate allow/disallow rules per crawler (don’t rely on User-agent: * for AI bots)
  • Sitemap: directive pointing to your XML sitemap
  • No accidental wildcards blocking AI crawlers (check for Disallow: / under User-agent: *)
  • Validate with Google’s robots.txt tester and manual cURL tests for each user agent
CDN and Firewall Rules (1-2 hours)
  • Cloudflare: Bot Fight Mode configured to exclude AI crawlers
  • Cloudflare: WAF custom rule whitelisting PerplexityBot user agent
  • AWS WAF / Akamai / Sucuri: verify AI crawler user agents aren’t in bot block lists
  • Rate limiting: AI crawlers exempted or given higher thresholds (minimum 120 requests/minute)
  • Geographic blocking: confirm no country-level blocks affecting crawler origin IPs (GPTBot: US; PerplexityBot: US; Claude-Web: US)
Server Log Verification (2-3 hours)
  • Confirm each allowed AI crawler appears in logs within the past 14 days
  • 200 response rate above 95% per crawler
  • No unexpected 403, 429, or 503 responses
  • Crawl depth: AI crawlers reaching pages beyond level 2 of your site architecture
  • Crawl frequency: GPTBot visiting at least weekly, Perplexity in real time
JavaScript Rendering (1-2 hours)
  • Disable JavaScript in browser and verify all critical content renders in HTML source
  • SSR or SSG configured for all JavaScript framework pages
  • Pre-rendering service active if SSR isn’t possible
  • No content hidden behind click-to-expand, tabs, or accordions that require JS to reveal
  • Fetch and render test using Google Search Console’s URL Inspection tool (approximates AI crawler behavior)
Sitemap Health (30-60 minutes)
  • XML sitemap accessible and returns 200
  • All indexable pages included (compare against Screaming Frog crawl)
  • Accurate <lastmod> dates reflecting actual content changes
  • Sitemap size under 50,000 URLs / 50MB per file (use sitemap index if needed)
  • Sitemap URL declared in robots.txt
  • No non-canonical or redirecting URLs in sitemap
Page Speed for Crawlers (30-60 minutes)
  • TTFB under 800ms for top 50 pages
  • Full HTML response under 1.5 seconds
  • Server-side caching active (Redis, Varnish, or equivalent)
  • No render-blocking resources preventing initial HTML delivery
  • CDN configured for HTML caching (not just static assets)
llm.txt and Emerging Standards (30 minutes)
  • llm.txt file created and deployed at domain root
  • llm.txt contains site description, priority URLs, and content categories
  • Optional: llm-full.txt with extended content summaries
  • Monitor for new standards (ai.txt, model-access.txt proposals are in discussion)

What Happens After You Fix AI Crawlability?

Fixing crawlability is the prerequisite, not the destination. Once AI crawlers can access your content, the question shifts to whether that content gets cited, quoted, or referenced in AI-generated answers. Think of it as a two-stage funnel. Stage one: can the AI read your page? That’s crawlability. Stage two: does the AI choose your page as a source? That’s AI visibility, and it depends on content quality, entity authority, structural formatting, and freshness. We see a consistent pattern across clients. After fixing crawlability issues, AI crawler activity increases within 1-2 weeks. Within 4-6 weeks, properly structured pages start appearing in AI-generated responses. The sites that fix crawlability but don’t optimize content structure see more crawl activity but no meaningful increase in citations. The 63-site audit we referenced earlier? The 9 sites that passed full crawlability had an average of 340 monthly AI citations (across ChatGPT, Perplexity, Gemini, and AI Overviews combined). The 41 sites with only robots.txt configured averaged 85 monthly citations. And the 13 sites that failed crawlability entirely averaged 12. Access is the floor, not the ceiling. Run this checklist quarterly. AI crawlers update their behavior, CDN providers change their bot detection rules, and CMS updates can silently break rendering. What works in March 2026 might be broken by June. Build the audit into your regular technical SEO review cycle, and you’ll catch regressions before they cost you citations.
AI Crawlability Audit

Find Out What AI Crawlers Actually See on Your Site

We audit your site across all 7 AI crawlers, fix access issues, and build the technical foundation for AI visibility. Most fixes take under a week. Get Your AI Crawlability Audit

Free Growth Audit
Call Now Get Free Audit →