GEO

AI Crawling

AI Crawling refers to the process by which automated bots operated by AI companies—such as GPTBot, ClaudeBot, and PerplexityBot—visit and collect content from websites. The collected data is used for a variety of purposes, including large language model (LLM) training, AI search result generation, and real-time question answering.

AI Crawling refers to the process by which automated bots operated by AI companies—such as GPTBot, ClaudeBot, and PerplexityBot—visit and collect content from websites. The collected data is used for a variety of purposes, including large language model (LLM) training, AI search result generation, and real-time question answering.

Why It Matters

As of 2025–2026, AI crawler traffic is growing rapidly as a share of total bot traffic, with training-purpose crawling accounting for approximately 80% of all AI bot activity. For content creators, AI Crawling is significant in two ways. First, you need to be able to control whether your content is used as training data for AI models without authorization. Second, if you want your content to be cited and surfaced in AI search engines (Perplexity, ChatGPT Search, Gemini, etc.), you must allow the relevant search crawlers to access your site. In other words, managing AI Crawling is a strategic challenge of balancing content protection with securing AI visibility (LLM Visibility).

Major AI Crawlers

As of 2026, the major AI crawlers, their operators, and primary purposes are as follows:

User-AgentOperatorPrimary Purpose
GPTBotOpenAIModel training data collection
OAI-SearchBotOpenAIChatGPT search result generation
ChatGPT-UserOpenAIReal-time page retrieval during user conversations
ClaudeBotAnthropicModel training data collection
Claude-SearchBotAnthropicClaude search result indexing
Claude-UserAnthropicReal-time page retrieval for user queries
Google-ExtendedGoogleGemini model training control token
PerplexityBotPerplexityWeb crawling for AI search
CCBotCommon CrawlOpen web archive (used for training many AI models)
BytespiderByteDanceTikTok search and AI features
meta-externalagentMetaMeta AI feature support
Applebot-ExtendedAppleApple Intelligence training
AmazonbotAmazonAlexa and Amazon AI services

Googlebot accounts for 38.7% of all AI-related bot requests, followed by GPTBot at 12.8%, meta-externalagent at 11.6%, and ClaudeBot at 11.4%—these four crawlers collectively represent approximately 74% of all AI bot traffic.

How to Allow or Block AI Crawlers

AI crawler access is controlled through the robots.txt file. Most major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) officially state that they comply with robots.txt directives.

Example: Blocking all AI training crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

Example: Blocking training while allowing AI search visibility:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search/real-time retrieval crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Note that Google-Extended is a control token rather than a traditional crawler, so it does not appear directly in server logs. It is used to restrict Gemini training without blocking Googlebot itself.

Strategic Considerations

Trade-off between training blocking and AI search visibility: Blocking all AI crawlers wholesale protects your content but prevents it from being cited in AI search results. Selectively allowing access by distinguishing between training bots and search bots is the most recommended strategy as of 2026.

Regular audits are essential: AI companies frequently introduce new crawler User-Agents. When Anthropic consolidated its previous anthropic-ai and Claude-Web agents into ClaudeBot, sites that did not update their rules were inadvertently left accessible. You should review your robots.txt at least once per quarter.

Cloudflare Pay-per-Crawl: In July 2025, Cloudflare launched a Pay-per-Crawl feature that allows site owners to receive micropayments of $0.01–$0.05 per AI bot crawl request. This has attracted attention as a new option for content monetization.

Server log monitoring: Even after configuring robots.txt, it is important to verify through server logs that crawlers are actually complying with your directives. Some smaller AI crawlers have been reported to ignore robots.txt, in which case firewall-level blocking may be necessary.

Sources:

Related inblog Posts

How inblog Helps

inblog's robots.txt allows search engine crawlers by default. Per-bot AI crawler settings (allow/block) can be managed through the dashboard's robots.txt editor.