GetCited

Which AI Crawlers Actually Matter in 2026

By Malcolm Michaels · January 3, 2026

The AI crawler landscape shifted dramatically in 2025. Bytespider, which dominated at 42% of AI traffic, collapsed to 7%. GPTBot surged from 5% to 30%. And 13.26% of AI bots now ignore robots.txt entirely, up from 3.3% a year ago.

I maintain a database of 38 AI crawlers for GetCited. Here’s what actually matters based on the latest traffic data.

The Real Numbers

Cloudflare’s data shows AI bots (excluding Googlebot) now account for 4.2% of global HTML requests. Add Googlebot’s 4.5%, and nearly 9% of web traffic is crawler-related.

But the mix changed fast:

Crawler	2024 Share	2025 Share	Change
GPTBot	5%	30%	+305%
Meta-ExternalAgent	n/a	19%	New entry
Bytespider	42%	7%	-85%
ClaudeBot	11.7%	5.4%	-46%
Amazonbot	n/a	n/a	-35%

ChatGPT-User requests grew 2,825%. PerplexityBot grew 157,490% in raw request volume. These aren’t typos.

The Uncomfortable Truth About Crawl-to-Referral Ratios

Here’s what nobody talks about: most AI crawling sends almost no traffic back.

Research shows Anthropic’s crawl-to-referral ratio as high as 500,000:1. They crawl 500,000 pages for every 1 visitor they send. OpenAI’s ratio: 3,700:1.

The breakdown of why AI crawlers hit your site:

80%: Training data collection
18%: AI search indexing
2%: User-triggered actions

If you’re allowing crawlers hoping for traffic, the math isn’t great for most of them. Perplexity is the exception: they actually send referral traffic because they cite sources visibly. How to get cited by Perplexity.

Which Crawlers to Allow

Based on current data, here’s how I’d prioritize:

For AI Search Citations (Allow These)

These crawlers are attached to products that cite sources and send traffic:

PerplexityBot: The clear winner for publishers. 157,490% growth and they show citations on every answer. Real referral traffic. Docs

OAI-SearchBot: ChatGPT’s search feature. Newer, growing. Docs

YouBot: You.com’s search. Smaller but cites sources. Docs

DuckAssistBot: DuckDuckGo’s AI features. Privacy-focused audience. Docs

xAI-Grok-Bot: Grok now cites sources, so there’s visibility upside. Docs

For User-Triggered Access (Allow These)

These crawlers fetch pages when a user asks the AI to look something up:

ChatGPT-User: Up 2,825%. When someone asks ChatGPT to browse a URL, this is what hits your server.

Perplexity-User: Real-time fetches for Perplexity queries.

Claude-User: Claude’s equivalent.

Training Crawlers (Your Call)

These collect data for model training. No direct citation benefit:

GPTBot: Dominates at 30% of AI traffic. Feeds ChatGPT training and features. Blocking it means invisibility to ChatGPT entirely, but the crawl-to-referral ratio is poor. Docs

ClaudeBot: Down 46% but still significant. Training + some real-time access. Docs

Google-Extended: Gemini and AI Overviews. Separate from Googlebot (search rankings unaffected). Docs

CCBot: Common Crawl’s nonprofit archive. Used by many AI companies for training data. Docs

Meta-ExternalAgent: Strong new entry at 19% share. Aggressive crawling for Meta AI.

anthropic-ai: Anthropic’s training crawler. 500,000:1 crawl-to-referral ratio.

Which Crawlers to Block

Bytespider (ByteDance): Collapsed from 42% to 7% market share, but still known for aggressive behavior. No documentation, no transparency. I block it.

Stealth crawlers: Here’s the problem. ChatGPT’s Atlas browser and Operator agent use standard Chrome user-agent strings. They’re indistinguishable from regular browser traffic. You can’t block them via robots.txt. There’s no solution for this yet. More on what agentic AI means for publishers.

The Blocking Landscape

For context on what other publishers are doing:

5.6 million websites now block GPTBot (up 70% since July 2025)
5.8 million websites block ClaudeBot

Source: The Register

The trend is clear: publishers are increasingly blocking training crawlers while trying to preserve search/citation visibility.

The Enforcement Problem

robots.txt is advisory. Bots can ignore it.

And increasingly, they do. 13.26% of AI bot requests ignored robots.txt directives in Q2 2025, up from 3.3% in Q4 2024. That’s a 4x increase in non-compliance in six months.

If a crawler ignores your robots.txt, your options are:

Block by user-agent at the server level (.htaccess or nginx)
Rate-limit suspicious traffic
Accept that some crawling is uncontrollable

The stealth crawlers using Chrome user-agents? There’s currently no way to block those without blocking real users.

My Recommendations

If you want maximum AI visibility: Allow everything. Accept that most crawling is for training with poor referral returns, but you’ll be in the training data and searchable across all AI products.

If you want citations without training contribution:

# Allow search/citation crawlers
User-agent: PerplexityBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: xAI-Grok-Bot
Allow: /

# Block training crawlers
User-agent: GPTBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /

If you want to block aggressively: Block all AI crawlers and accept invisibility to AI products. Join the 5.6 million sites blocking GPTBot.

Quick Reference

Crawler	Share	Recommendation
GPTBot	30%	Your call: dominates but poor referral ratio
PerplexityBot	Growing fast	Allow: best for citations
Meta-ExternalAgent	19%	Block: aggressive, training only
Bytespider	7% (down from 42%)	Block: opaque, aggressive
ClaudeBot	5.4%	Your call: declining share
ChatGPT-User	+2,825%	Allow: user-triggered
OAI-SearchBot	Growing	Allow: ChatGPT search
xAI-Grok-Bot	n/a	Allow: now cites sources

Full list with user-agent strings: AI Crawlers Reference

FAQ

Should I allow GPTBot?

Depends on your goals. It’s 30% of AI crawler traffic and feeds ChatGPT’s features. But the crawl-to-referral ratio is 3,700:1. You’re contributing training data, not getting much traffic back. I’d allow OAI-SearchBot and ChatGPT-User while blocking GPTBot if you want ChatGPT search visibility without training contribution.

Why did Bytespider collapse?

Unclear. ByteDance hasn’t commented. Possibly reduced crawling after backlash, possibly shifted to other methods. Either way, it’s no longer the dominant force it was.

Can I stop stealth crawlers?

Not currently. ChatGPT’s Atlas and Operator use standard Chrome user-agents. Blocking them means blocking real Chrome users. This is an unsolved problem.

Do AI companies respect robots.txt?

Less than they used to. 13.26% of requests now ignore robots.txt, up from 3.3% a year ago. The major companies (OpenAI, Anthropic, Google) publicly commit to respecting it, but enforcement is impossible.

Is blocking even worth it?

For training crawlers, maybe. You’re not getting traffic from them anyway. For search/citation crawlers, blocking hurts your visibility with no upside. The nuanced approach is blocking training while allowing search.

Malcolm Michaels is the founder of HeyTC and creator of GetCited. This post was developed with AI assistance: human direction, AI drafting, human editing.