By Nikhil Kumar, Founder of LandKit. Last updated May 2026.
You can already see them in your access logs. The question is whether the bot calling itself GPTBot is actually GPTBot, what each one is doing with your pages, and which to wave through.
This 2026 AI crawlers list covers the 14 user agents that account for nearly every AI crawl request hitting public websites today. For each bot you get the exact user agent string, what the bot does (training, inference, search indexing), how to verify it in server logs, and a recommendation on whether to allow or block it based on your business model. Cloudflare's Q1 2026 network data shows AI crawlers now generate 22% of all bot traffic, so getting this list right has real consequences for your traffic and your training-data exposure.
What is the difference between a training crawler, an inference fetcher, and an AI search bot?
Training crawlers download pages in bulk to feed model training datasets and send back roughly zero referral traffic. Inference fetchers pull a specific URL only when a user types a prompt that needs it, behaving more like a browser than a spider. AI search bots index content for citation inside ChatGPT Search, Claude, Perplexity, and Gemini answers, and these are the only AI bots that send meaningful traffic back to your site.
The categories matter because the right block decision depends on which one you are looking at.
A training crawler has no upside for you unless you want your content in the model. An AI search bot is the closest thing to Googlebot in this stack and blocking it kills your AI citation surface.
According to Cloudflare's crawl-to-click gap analysis published in 2025, dedicated AI training crawlers generated 49.9% of all AI bot traffic by Q1 2026, hitting the 50% milestone a full quarter ahead of the previous forecast.
The crawl-to-refer ratio numbers tell the story even more brutally. Anthropic's ClaudeBot crawls 20,583 pages for every one referral it sends back, per Cloudflare's Q1 2026 Radar publisher analysis. OpenAI sits at 1,255:1. Meta sends zero.
If you want a quick pre-flight check on your own robots.txt before you start tuning anything, run it through the LandKit robots.txt validator so you are tuning a clean file, not chasing ghost rules.
Which AI crawlers should I allow vs block in 2026?
Allow every bot tied to AI search and inference (OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot, PerplexityBot, Google standard crawlers, Applebot) because these are how AI engines find and cite your content for users actively asking questions. Block training-only crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Bytespider, Meta-ExternalAgent, CCBot) if you sell content, run a niche moat, or otherwise lose by feeding the model. The mix is rarely all-or-nothing.
The framework I run for every LandKit user comes down to three questions.
Does your business depend on being cited inside AI answers? If yes, allow the search and inference bots without exception.
Is your content your moat (paywall, training data, originality)? If yes, block the training crawlers.
Are you a generic top-of-funnel content site? If yes, the most defensible default in 2026 is allow search bots, block training bots, and revisit quarterly.
Cloudflare's Q1 2026 robots.txt analysis found that GPTBot is the most blocked AI crawler on the web, followed by CCBot, ClaudeBot, and Google-Extended. The bots people actually want to block tend to be the training crawlers, which is the right instinct.
The complete 2026 AI crawlers list
Below is the working directory I keep refreshed for LandKit users. Every user agent is verified against the operator's official documentation as of May 2026. For each row, the "Purpose" column maps the bot to one of three jobs: training, inference, or search.
| User agent | Operator | Purpose | Honors robots.txt | Recommended action |
|---|
| GPTBot | OpenAI | Training | Yes | Block if you protect content; allow if you want training inclusion |
| OAI-SearchBot | OpenAI | Search indexing for ChatGPT Search | Yes | Allow |
| ChatGPT-User | OpenAI | User-triggered fetch | Yes | Allow |
| ClaudeBot | Anthropic | Training | Yes | Block if you protect content |
| Claude-SearchBot | Anthropic | Search indexing for Claude | Yes | Allow |
| Claude-User | Anthropic | User-triggered fetch | Yes | Allow |
| Google-Extended | Google | Gemini training token (no separate crawl) | Yes | Block if you protect content |
| Googlebot | Google | Web search index | Yes | Allow (this is your search traffic) |
| PerplexityBot | Perplexity | Indexing for Perplexity answers | Sometimes | Allow with monitoring |
| Perplexity-User | Perplexity | User-triggered fetch | Sometimes | Allow with monitoring |
| Applebot | Apple | Siri, Spotlight, Apple Intelligence search | Yes | Allow |
| Applebot-Extended | Apple | Apple Intelligence training token | Yes | Block if you protect content |
| Bytespider | ByteDance | Training for Doubao plus TikTok features | Inconsistent | Block in most cases |
| Meta-ExternalAgent | Meta | Training plus product indexing | Yes | Block if you protect content |
| CCBot | Common Crawl | Public web archive used by most LLM trainers | Yes | Block if you protect content |
OpenAI's official bots documentation is the canonical reference for the three OpenAI agents. Anthropic's crawler help article, updated February 20, 2026, covers all three Claude agents.
Once you have decided which to allow and block, the LandKit AI crawler robots generator will produce the exact robots.txt block in one paste.
What is the GPTBot user agent string and what does it actually do?
GPTBot is OpenAI's training crawler. The user agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot and the bot's job is to download web content that may be used to train future GPT models. It honors robots.txt and OpenAI says changes to your file propagate within roughly 24 hours. GPTBot is the most blocked AI crawler on the web in 2026, per Cloudflare's network-wide analysis.
This is the bot most publishers care about most.
GPTBot's traffic share grew from 2.2% to 7.7% of all crawler traffic between May 2024 and May 2025, a 305% increase per Cloudflare Radar. It then dropped to 9.84% of AI crawler traffic by April 2026, marking its second consecutive month of decline as Applebot picked up share.
About 25% of the top 1,000 websites now block GPTBot in robots.txt, up from 5% in early 2023, according to coverage of the same Cloudflare dataset reported by Search Engine Journal in 2025. That is the steepest opt-out adoption curve in the history of the robots.txt standard.
Blocking GPTBot has zero impact on your Google Search rankings. It also has zero impact on whether ChatGPT can cite you in live answers, because that surface uses OAI-SearchBot, which is a different agent.
What is the ClaudeBot user agent and how is it different from Claude-User?
ClaudeBot is Anthropic's training crawler. Its user agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected]). Claude-User is the user-triggered fetcher, called only when a Claude conversation pulls a specific URL. Claude-SearchBot is the indexing agent for Claude's search-style answers. The three bots can be controlled independently in robots.txt and they serve different jobs.
This separation is the cleanest in the AI industry right now.
If you block ClaudeBot, you stop new content from feeding Claude training. You do not stop Claude from citing your existing pages when a user asks about your topic.
If you block Claude-SearchBot, you do stop the citations. That is rarely what site owners want.
The framing I give LandKit operators: block ClaudeBot if your content is your moat, allow the other two so Claude users can still find and reference you. Anthropic's bots.json IP list lets you verify by IP rather than just user agent, which matters because user agent strings are trivially spoofed.
How do I recognize AI bots in server logs and tell real bots from fakes?
Look for the official user agent strings (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Claude-SearchBot, Applebot, Bytespider, Meta-ExternalAgent) in your access logs, then verify each hit with reverse DNS plus forward confirmation. A 2025 study cited by Arcjet found that 5.7% of traffic claiming to be a well-known AI crawler is fake, and the ChatGPT user agent in particular shows a 16.7% spoof rate. User agent string alone is not proof.
The real test is reverse DNS.
Run a reverse DNS lookup on the source IP. Legitimate GPTBot traffic resolves to openai.com. Then run a forward DNS lookup on that hostname and confirm the IP matches.
Anthropic, OpenAI, Google, and Apple all publish official IP ranges so you can verify without DNS calls. Arcjet's user agent identification guide walks through the full HTTP-signature approach for the more careful checks.
If you see a "GPTBot" user agent coming from a residential ASN with no openai.com reverse DNS, that is a scraper hiding behind OpenAI's name. Block it on the IP, not just the user agent.
Which AI crawlers ignore robots.txt and what do you do about them?
Perplexity is the highest-profile case. In August 2025, Cloudflare published evidence that Perplexity was using stealth, undeclared crawlers to evade robots.txt and IP blocks, switching to a Chrome-on-macOS impersonator user agent and rotating ASNs. Bytespider has a long history of inconsistent robots.txt compliance. The defensive answer is not robots.txt at all; it is WAF rules, ASN-level blocks, or a managed AI crawler firewall.
Cloudflare's August 4, 2025 stealth crawler report documented 3-6 million daily stealth requests across tens of thousands of domains, on top of the 20-25 million daily declared requests.
Cloudflare de-listed Perplexity as a verified bot in response and rolled new detection heuristics into its managed AI rules.
The lesson for the rest of us is that robots.txt is a request, not a fence. If a crawler decides to ignore it, you need an actual control plane: WAF rules, rate limiting, ASN blocks, or a dedicated AI crawler firewall like Cloudflare AI Crawl Control.
If your robots.txt is the only thing you have, run it through the robots.txt validator so you at least know it is parseable. Then layer real controls on top.
Does Google-Extended actually crawl pages or is it just a token?
Google-Extended does not crawl your site. It is a robots.txt control token that Google uses to decide whether content already crawled by Googlebot can be passed into Gemini and Vertex AI training. Disallowing Google-Extended opts your content out of Google's generative AI training without affecting Search rankings or AI Overview eligibility for already-indexed pages. The same pattern applies to Applebot-Extended.
This is one of the most misunderstood entries on the AI crawlers list.
People expect Google-Extended to show up in their server logs as a separate hit. It does not.
Googlebot does the crawling. Google-Extended is a routing flag. When you disallow Google-Extended in robots.txt, you are telling Google not to use already-fetched content for AI training. You are not stopping any HTTP request from happening.
Google announced the token on September 28, 2023, per the original TechCrunch coverage. Apple introduced Applebot-Extended with the same architecture roughly a year later. Both are pure opt-out signals.
That detail matters when you are auditing logs and thinking you have a missing user agent. You do not. There is nothing to audit.
How does blocking AI crawlers affect SEO and AI citations?
Blocking training crawlers (GPTBot, ClaudeBot, Google-Extended, Bytespider) has no effect on Google Search rankings, no effect on Bing rankings, and no effect on whether ChatGPT, Claude, or Perplexity can cite your already-existing content in live answers. Blocking AI search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) is the one move that does kill AI citations. Most operators conflate the two and lose AI visibility while trying to protect training data.
The split matters more than any other single decision in this guide.
Per OpenAI's publisher FAQ, blocking GPTBot does not affect ChatGPT search citations because those use OAI-SearchBot.
Anthropic confirmed the same separation when it updated its crawler documentation in February 2026.
The right pattern for almost every site I audit through the LandKit free SEO audit is: block training, allow search, allow inference. That defends the content asset and keeps you visible across the AI surfaces buyers actually use.
If you also have an llms.txt file telling AI crawlers what content you actually want surfaced, run it through the LandKit llms.txt validator before you ship it.
Frequently asked questions
What is the GPTBot user agent string?
The exact GPTBot user agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot. OpenAI publishes this and two others (OAI-SearchBot, ChatGPT-User) in its official bots documentation. To verify a request actually came from OpenAI, run reverse DNS on the source IP and confirm it resolves to openai.com, then forward-resolve to confirm the IP matches.
Should I block ClaudeBot if I want Claude to cite my site?
No, blocking ClaudeBot does not stop Claude from citing you. ClaudeBot is the training crawler. Citations in Claude's live answers come from Claude-SearchBot and Claude-User, which are independent agents. The standard pattern is to disallow ClaudeBot in robots.txt while allowing Claude-SearchBot and Claude-User. Anthropic confirmed this separation in its February 2026 documentation update.
What is the difference between Google-Extended and Googlebot?
Googlebot is Google's web search crawler and does the actual page fetching. Google-Extended is a robots.txt-only token that controls whether content already fetched by Googlebot can be used for Gemini and Vertex AI training. Google-Extended does not appear as its own user agent in server logs because it never makes a separate HTTP request. Disallowing it has zero impact on Google Search.
Is PerplexityBot safe to block?
PerplexityBot is the declared user agent and it does honor robots.txt when it announces itself. The problem is Cloudflare documented in August 2025 that Perplexity also uses undeclared stealth crawlers impersonating Chrome on macOS, generating 3-6 million daily requests across tens of thousands of domains. If you decide to block Perplexity, layer ASN blocks and WAF rules on top of robots.txt because robots.txt alone will not hold.
How do I verify a bot is really GPTBot and not a fake?
Run a reverse DNS lookup on the source IP address. Legitimate GPTBot traffic resolves to openai.com. Then run a forward DNS lookup on that hostname and confirm the IP matches the original request IP. A 2025 Arcjet study found 5.7% of AI crawler traffic is fake, with the ChatGPT user agent showing a 16.7% spoof rate, so user agent string alone is not enough.
Will blocking AI crawlers hurt my Google rankings?
Blocking GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Bytespider, or CCBot has no effect on Google Search rankings, Bing rankings, or AI Overview eligibility. These are training-only crawlers. Google has confirmed Google-Extended is exclusive to Gemini Apps and has no impact on Google Search. Blocking AI search bots like OAI-SearchBot or Claude-SearchBot does affect AI citations, which is a different question.
Pick the three to block this week
Open your robots.txt today. Add disallow rules for GPTBot, ClaudeBot, and CCBot if you protect your content, then verify the file parses cleanly, then watch your access logs for one week to confirm those user agents stop showing up. Layer Cloudflare AI Crawl Control or an equivalent firewall on top if Perplexity or Bytespider keep hitting after the block. Revisit the list every quarter, because this directory shifts faster than any other corner of SEO right now.
Nikhil Kumar is the founder of LandKit, the SEO and AI visibility growth OS used by solo founders and lean SaaS teams to track brand mentions across ChatGPT, Claude, Gemini, and Perplexity. He writes about AI search, technical SEO, and what actually moves traffic for under-resourced teams. Connect on LinkedIn.