
AI crawlability determines whether AI platforms like ChatGPT, Perplexity, Google AI Mode, and Claude can access your website content and use it to generate responses. If your site blocks AI crawlers , intentionally or by accident , your brand won’t appear in AI-generated answers, regardless of how good your content is.
This guide walks through the full audit process: checking your robots.txt, verifying server logs for AI bot activity, testing specific AI crawler access, and fixing the gaps.
“We run AI crawlability audits for every client, and the result surprises people every time. About seven out of ten enterprise sites block at least one major AI crawler without knowing it. Their security team or hosting provider added the rule, and nobody in marketing was told. The brand wonders why it’s invisible on ChatGPT , the answer is in a three-line robots.txt rule they never reviewed.”
, Hardik Shah, Founder of ScaleGrowth.Digital
What Are AI Crawlers and Why Do They Matter?
AI crawlers are bots that visit your website, read your content, and feed it into AI systems. Some crawlers collect training data for AI models. Others fetch content in real-time to generate responses with citations.
The distinction matters for your strategy:
Training crawlers collect content to train or fine-tune AI models. If you block these, your content won’t be part of the model’s knowledge base, but it might still get cited through real-time search.
Retrieval crawlers access your content at the moment a user asks a question. If you block these, you can’t get cited in real-time AI responses , even if the model already knows about your brand from training data.
For most brands that want AI visibility, blocking retrieval crawlers is the bigger problem. Blocking training crawlers is a business decision with different trade-offs.
The Major AI Crawlers You Need to Know
| Bot Name | Company | Purpose | Respects robots.txt | User-Agent String |
|---|---|---|---|---|
| GPTBot | OpenAI | Training + retrieval for ChatGPT | Yes | GPTBot |
| OAI-SearchBot | OpenAI | Real-time search for ChatGPT | Yes | OAI-SearchBot |
| ChatGPT-User | OpenAI | ChatGPT browsing mode | Yes | ChatGPT-User |
| ClaudeBot | Anthropic | Training for Claude | Yes | ClaudeBot |
| PerplexityBot | Perplexity AI | Real-time search + citation | Yes | PerplexityBot |
| Google-Extended | Training for Gemini/AI Mode | Yes | Google-Extended | |
| Googlebot | Search indexing + AI Mode retrieval | Yes | Googlebot | |
| Bytespider | ByteDance | Training for TikTok AI features | Partially | Bytespider |
| Applebot-Extended | Apple | Training for Apple Intelligence | Yes | Applebot-Extended |
| cohere-ai | Cohere | Training for enterprise AI | Yes | cohere-ai |
Step 1: Audit Your robots.txt File
Your robots.txt file is the first thing any crawler reads when it visits your site. It’s located at yourdomain.com/robots.txt. Open it and look for any rules targeting AI crawlers.
What to look for:
# These rules BLOCK AI crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
If you see Disallow: / under any of these user agents, that bot is blocked from your entire site. If you see Disallow: with specific paths (like /admin/ or /private/), only those paths are blocked , which may be intentional.
Also check for blanket blocks:
# This blocks EVERYTHING , including AI crawlers
User-agent: *
Disallow: /
A wildcard disallow blocks every bot, including AI crawlers. This is rare on production sites but common on staging environments that accidentally go live.
What your robots.txt should look like for AI visibility:
# Allow all major AI crawlers
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
# Block sensitive areas from all bots
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /account/
Step 2: Check Your Server Logs for AI Bot Activity
Robots.txt tells bots what they’re allowed to do. Server logs tell you what bots are actually doing. Even if your robots.txt allows AI crawlers, they may not be visiting , or they may be visiting and getting errors.
How to check:
Access your server logs (usually in /var/log/nginx/access.log or /var/log/apache2/access.log, or through your hosting control panel). Search for the AI bot user-agent strings listed in the table above.
What to look for in the logs:
# Healthy AI crawler visit
66.249.xx.xx - - [14/Mar/2026:10:23:45 +0000] "GET /blog/your-article/ HTTP/2.0" 200 45230 "-" "GPTBot/1.0"
# Blocked AI crawler
66.249.xx.xx - - [14/Mar/2026:10:23:45 +0000] "GET /blog/your-article/ HTTP/2.0" 403 1234 "-" "ClaudeBot/1.0"
# Rate-limited AI crawler
66.249.xx.xx - - [14/Mar/2026:10:23:45 +0000] "GET /blog/your-article/ HTTP/2.0" 429 567 "-" "PerplexityBot/1.0"
Status codes to watch for:
| Status Code | Meaning | Impact on AI Visibility | Action Required |
|---|---|---|---|
| 200 | Success , content served | Good , content accessible | None |
| 301/302 | Redirect | Usually fine if it resolves | Verify redirect chain isn’t broken |
| 403 | Forbidden | Bad , bot is blocked by server | Check firewall, WAF, and .htaccess rules |
| 429 | Rate limited | Bad , bot can’t crawl enough pages | Adjust rate limiting for AI bots |
| 500/503 | Server error | Bad , bot sees broken site | Fix server issues |
Step 3: Test AI Crawler Access Directly
Don’t just rely on robots.txt and logs. Test whether AI crawlers can actually fetch your content by simulating their requests.
Using curl to simulate AI crawler requests:
# Test GPTBot access
curl -A "GPTBot/1.0" -I https://yourdomain.com/your-important-page/
# Test ClaudeBot access
curl -A "ClaudeBot/1.0" -I https://yourdomain.com/your-important-page/
# Test PerplexityBot access
curl -A "PerplexityBot/1.0" -I https://yourdomain.com/your-important-page/
# Test OAI-SearchBot access
curl -A "OAI-SearchBot/1.0" -I https://yourdomain.com/your-important-page/
You want to see a 200 OK response for each. If you get 403 Forbidden, something beyond robots.txt is blocking the bot , typically a web application firewall (WAF), a CDN rule, or server-level configuration.
Common hidden blocking mechanisms:
- Cloudflare Bot Fight Mode: Cloudflare’s bot protection can block AI crawlers by default. Check your Cloudflare dashboard under Security > Bots.
- WAF rules: Some web application firewalls block bots based on user-agent patterns. AI crawler user-agents might match generic “bot” blocking rules.
- .htaccess rules: Apache servers sometimes have bot-blocking rules in .htaccess that target specific user agents or IP ranges.
- Hosting provider defaults: Some managed hosting providers (especially those focused on WordPress) block AI crawlers at the server level as a “security” feature.
Step 4: Check Your Content Delivery
Even if AI crawlers can access your pages, they might not see your actual content. This happens when content is rendered with JavaScript after the page loads.
The JavaScript rendering problem: If your site uses a JavaScript framework (React, Vue, Angular) for client-side rendering, AI crawlers may see an empty page or a loading spinner. Most AI crawlers don’t execute JavaScript , they read the raw HTML.
How to check:
# Fetch your page without JavaScript execution
curl -s https://yourdomain.com/your-page/ | grep -c "your-key-content-phrase"
# If the count is 0, the content is JavaScript-rendered and invisible to AI crawlers
The fix: Server-side rendering (SSR) or static site generation (SSG). Your content must be in the initial HTML response , not loaded after JavaScript executes.
Step 5: Verify Schema Markup Is Accessible
Schema markup (structured data) helps AI crawlers understand your content’s context. Check that your schema is present in the page source and valid.
Test with Google’s Rich Results Test: Enter your URL at search.google.com/test/rich-results. This shows you exactly what structured data Google (and by extension, other AI crawlers) can see on your page.
Priority schema types for AI crawlability:
- Organization: Tells AI who you are, what you do, where you’re located
- Article: Tells AI who wrote the content, when it was published, when it was updated
- FAQPage: Provides clean question-answer pairs that AI can extract directly
- HowTo: Provides step-by-step processes in a format AI can cite
- WebPage: Basic page metadata that helps AI categorize your content
Step 6: Review Your AI Crawl Budget
AI crawlers have limits on how many pages they’ll crawl per site. If your site has thousands of pages, the crawler may not reach your most important content.
Prioritize by:
- Ensuring your most important pages are in your XML sitemap
- Keeping your site’s internal linking structure clean , important pages should be reachable within 2-3 clicks from the homepage
- Removing or noindexing thin content pages that waste crawl budget
- Fixing broken internal links that create dead ends for crawlers
The Full AI Crawlability Audit Checklist
| Check | What to Test | Pass Criteria | Tools |
|---|---|---|---|
| robots.txt review | AI bot directives | No blanket disallow for AI bots | Browser, robots.txt validator |
| Server log analysis | AI bot visit frequency and status codes | 200 status, regular visits | Server logs, log analyzer |
| Direct crawler simulation | curl with AI user-agents | 200 OK for all major bots | curl, terminal |
| WAF/CDN check | Bot blocking rules | AI bots whitelisted | Cloudflare/WAF dashboard |
| JavaScript rendering | Content in raw HTML | Key content visible without JS | curl, view-source |
| Schema validation | Structured data present and valid | No errors in Rich Results Test | Google Rich Results Test |
| Sitemap coverage | Key pages in XML sitemap | All priority pages included | Screaming Frog, manual check |
| Internal linking | Click depth to key pages | Important pages within 3 clicks | Screaming Frog, site crawler |
| Response time | Server response under load | Under 500ms TTFB | curl timing, WebPageTest |
| Content freshness signals | dateModified in schema and headers | Updated within 90 days | Schema validator, page source |
What to Do When You Find Problems
If robots.txt blocks AI crawlers: Edit the file to allow the specific bots you want. If your CMS manages robots.txt (WordPress does via plugins like Yoast or Rank Math), update it through the plugin settings.
If your WAF blocks AI crawlers: Add exceptions for the specific user-agent strings. In Cloudflare, go to Security > WAF > Custom Rules and create an allow rule for GPTBot, ClaudeBot, PerplexityBot, etc.
If your content is JavaScript-rendered: This is a bigger fix. You’ll need to implement server-side rendering or pre-rendering. For WordPress sites, this usually isn’t an issue since WordPress serves HTML by default. For React/Vue/Angular sites, consider Next.js, Nuxt.js, or a pre-rendering service.
If AI crawlers visit but don’t cite you: The crawlability is fine , the problem is your content structure. Read our guide on building citable content blocks to format your content for AI extraction.
How Often Should You Run This Audit?
Run a full AI crawlability audit quarterly. The AI crawler situation changes frequently , new bots appear, existing bots change their user-agent strings, and your hosting or security team may update configurations without informing marketing.
Between quarterly audits, set up automated monitoring:
- Alert on robots.txt changes (use a monitoring tool or script that checks the file daily)
- Monitor server logs weekly for AI bot 403/429 responses
- Track AI citation frequency monthly , a sudden drop often indicates a crawlability problem
Check whether your brand is currently visible to AI with our AI visibility self-test.
Get a Professional AI Crawlability Audit
If you want a thorough audit without doing it manually, our SEO audit service includes a full AI crawlability assessment as part of the technical audit. We check every AI crawler, review your server logs, test your WAF configuration, validate your schema, and deliver a prioritized fix list.
Our Organic Growth Engine includes ongoing AI crawlability monitoring , so you’re not just auditing once, you’re staying visible as the AI search environment evolves. Reach out to start the conversation.