LandKit

Free Reference

AI Crawler User-Agent Reference

The complete directory of AI crawlers and bots. User-agents, robots.txt rules, documentation links, and best practices for every major AI training, search, and browsing crawler on the open web.

23
Bots tracked
17
Companies
May 2026
Last updated

Search and filter

Category

Company

AI crawler directory

GPTBot

TrainingOOpenAI

OpenAI's primary training crawler. GPTBot fetches publicly available web pages so they can be used to train future versions of GPT and related foundation models. Blocking GPTBot prevents your content from being added to OpenAI training datasets going forward, but does not remove anything already ingested.

Allow GPTBot
User-agent: GPTBot
Allow: /
Block GPTBot
User-agent: GPTBot
Disallow: /
Note: Publishes IP ranges in a JSON file (gptbot.json). Will read and obey robots.txt rules targeting GPTBot specifically.

ChatGPT-User

BrowsingOOpenAI

On-demand fetcher used when a ChatGPT user pastes a URL or asks ChatGPT to read a specific page. ChatGPT-User does not crawl proactively. Each request is triggered by a real user action inside ChatGPT or a custom GPT.

Allow ChatGPT-User
User-agent: ChatGPT-User
Allow: /
Block ChatGPT-User
User-agent: ChatGPT-User
Disallow: /
Note: Blocking ChatGPT-User stops users from pulling your pages into a ChatGPT conversation. Different from GPTBot (training) and OAI-SearchBot (search index).

OAI-SearchBot

SearchOOpenAI

Powers ChatGPT Search results and the search-grounded answers shown in ChatGPT. OAI-SearchBot is the indexing crawler for OpenAI's search experience. It is intentionally separated from GPTBot so publishers can opt in to ChatGPT Search visibility while still blocking model training.

Allow OAI-SearchBot
User-agent: OAI-SearchBot
Allow: /
Block OAI-SearchBot
User-agent: OAI-SearchBot
Disallow: /
Note: Blocking GPTBot does NOT block OAI-SearchBot. They are independent. If you want to appear in ChatGPT Search you must allow OAI-SearchBot.

ClaudeBot

TrainingAAnthropic

Anthropic's consolidated training crawler for Claude foundation models. ClaudeBot replaced and unified earlier Anthropic crawlers. It fetches public web content for use in pre-training and fine-tuning Claude.

Allow ClaudeBot
User-agent: ClaudeBot
Allow: /
Block ClaudeBot
User-agent: ClaudeBot
Disallow: /
Note: ClaudeBot is the user-agent to target in robots.txt to opt out of Claude training. Anthropic publishes its IP ranges in a public JSON file.

anthropic-ai

TrainingAAnthropic

Legacy Anthropic crawler user-agent that predates ClaudeBot. Many published robots.txt files still target this token, and some Anthropic systems may still send it for compatibility. Best practice is to list both anthropic-ai and ClaudeBot.

Allow anthropic-ai
User-agent: anthropic-ai
Allow: /
Block anthropic-ai
User-agent: anthropic-ai
Disallow: /
Note: Considered superseded by ClaudeBot. List both for full coverage.

claude-web

BrowsingAAnthropic

On-demand fetcher used when Claude needs to read a specific URL provided by a user during a conversation. Like ChatGPT-User, it is user-initiated rather than a continuous crawler.

Allow claude-web
User-agent: claude-web
Allow: /
Block claude-web
User-agent: claude-web
Disallow: /
Note: Blocking claude-web stops Claude users from fetching your pages in real time. Independent from training (ClaudeBot).

Google-Extended

TrainingGGoogle

A robots.txt-only token, not an actual user-agent string. Google-Extended controls whether your content can be used to train Gemini, Vertex AI generative APIs, and future Google AI models. Disallowing Google-Extended does not affect Google Search ranking, indexing, or appearance in regular search results.

Allow Google-Extended
User-agent: Google-Extended
Allow: /
Block Google-Extended
User-agent: Google-Extended
Disallow: /
Note: IMPORTANT: Google-Extended does not appear as a user-agent header. It exists only as a robots.txt control. Blocking it has zero effect on Googlebot or Google Search visibility.

PerplexityBot

SearchPPerplexity

Perplexity's indexing crawler. PerplexityBot crawls the web to build the index that powers Perplexity AI search and answer cards. Perplexity cites sources in its answers, so being indexed can drive referral traffic.

Allow PerplexityBot
User-agent: PerplexityBot
Allow: /
Block PerplexityBot
User-agent: PerplexityBot
Disallow: /
Note: Perplexity has faced public criticism over honoring robots.txt. They have stated PerplexityBot now respects robots.txt and publish IP ranges for verification.

Perplexity-User

BrowsingPPerplexity

On-demand fetcher used when a Perplexity user requests a specific URL or asks Perplexity to read a page. Like other Browsing-class agents, it is user-initiated rather than a continuous index crawler.

Allow Perplexity-User
User-agent: Perplexity-User
Allow: /
Block Perplexity-User
User-agent: Perplexity-User
Disallow: /
Note: Per Perplexity's guidance, Perplexity-User does not strictly require robots.txt compliance because it is user-initiated, similar to a browser fetch.

Applebot-Extended

TrainingAApple

Controls whether Apple can use your content to train Apple Intelligence and Apple foundation models. Applebot-Extended is the AI-training opt-out, separate from regular Applebot which powers Siri and Spotlight Search.

Allow Applebot-Extended
User-agent: Applebot-Extended
Allow: /
Block Applebot-Extended
User-agent: Applebot-Extended
Disallow: /
Note: Like Google-Extended, this is a robots.txt-only token. Blocking Applebot-Extended does not remove your site from Siri or Spotlight results.

Meta-ExternalAgent

TrainingMMeta

Meta's primary AI training crawler. Used to gather public web data to train Llama and other Meta AI models. Meta-ExternalAgent is the modern user-agent for Meta's AI training pipeline.

Allow Meta-ExternalAgent
User-agent: Meta-ExternalAgent
Allow: /
Block Meta-ExternalAgent
User-agent: Meta-ExternalAgent
Disallow: /
Note: Different from FacebookExternalHit, which is the link-preview fetcher used when someone shares a URL on Facebook or Instagram.

FacebookBot

TrainingMMeta

Meta crawler used historically for training conversational AI products. Listed alongside Meta-ExternalAgent for full coverage of Meta-owned AI training agents.

Allow FacebookBot
User-agent: FacebookBot
Allow: /
Block FacebookBot
User-agent: FacebookBot
Disallow: /

Bytespider

TrainingBByteDance

ByteDance's training crawler, used to gather data for Doubao (TikTok's AI assistant) and other ByteDance AI products. Bytespider has been one of the most aggressive AI crawlers measured by request volume on independent publisher logs.

Allow Bytespider
User-agent: Bytespider
Allow: /
Block Bytespider
User-agent: Bytespider
Disallow: /
Note: Mixed reports on robots.txt compliance. Many publishers block Bytespider at the firewall or CDN level due to aggressive crawl rates.

CCBot

TrainingCCommon Crawl

CCBot crawls the web for the Common Crawl Foundation, which publishes a free, openly downloadable dataset of the web. Common Crawl data has been used to train OpenAI, Anthropic, Google, Meta, and most other major AI labs. Blocking CCBot is one of the highest-leverage actions a publisher can take to opt out of AI training.

Allow CCBot
User-agent: CCBot
Allow: /
Block CCBot
User-agent: CCBot
Disallow: /
Note: Blocking CCBot prevents your content appearing in future Common Crawl snapshots, but does not retroactively remove it from existing snapshots already published.

Diffbot

TrainingDDiffbot

Diffbot operates a structured-data crawler that powers a knowledge graph used by enterprise AI systems and commercial LLM training pipelines. Diffbot turns unstructured web pages into structured entities (people, products, articles).

Allow Diffbot
User-agent: Diffbot
Allow: /
Block Diffbot
User-agent: Diffbot
Disallow: /
Note: Often used as an enterprise data source. Blocking removes your content from the Diffbot Knowledge Graph.

YouBot

SearchYYou.com

You.com's indexing crawler for its AI-powered search engine. YouBot builds the index that backs You.com's answer experience and the YouChat assistant.

Allow YouBot
User-agent: YouBot
Allow: /
Block YouBot
User-agent: YouBot
Disallow: /

omgili

Trainingwwebz.io

omgili (now operated by webz.io) crawls news, forums, blogs, and reviews. The resulting datasets are sold to AI labs, enterprise NLP teams, and threat-intelligence platforms. A common but often unrecognized AI training source.

Allow omgili
User-agent: omgili
Allow: /
Block omgili
User-agent: omgili
Disallow: /
Note: Worth blocking if you want to opt out of commercial AI training datasets sold to third parties.

ImagesiftBot

ImageTTheHive.ai

Image-focused crawler operated by TheHive.ai. Used to gather images for visual AI training datasets, image recognition models, and reverse-image search products.

Allow ImagesiftBot
User-agent: ImagesiftBot
Allow: /
Block ImagesiftBot
User-agent: ImagesiftBot
Disallow: /
Note: Particularly relevant if you publish original photography, illustration, or visual editorial content.

cohere-ai

TrainingCCohere

Cohere's training crawler. Cohere builds enterprise-focused large language models, retrieval systems, and embedding APIs. cohere-ai gathers public web content used in Cohere's training pipeline.

Allow cohere-ai
User-agent: cohere-ai
Allow: /
Block cohere-ai
User-agent: cohere-ai
Disallow: /

Amazonbot

TrainingAAmazon

Amazon's general-purpose web crawler. Amazonbot powers Alexa answers, Amazon's search systems, and contributes to training Amazon-owned AI models including Titan and Nova.

Allow Amazonbot
User-agent: Amazonbot
Allow: /
Block Amazonbot
User-agent: Amazonbot
Disallow: /
Note: Amazon publishes IP ranges and a verification process so publishers can confirm requests genuinely come from Amazonbot.

MistralAI-User

BrowsingMMistral

On-demand fetcher used by Mistral chat products (Le Chat) when a user asks Mistral to read a specific URL. User-initiated rather than a continuous crawler.

Allow MistralAI-User
User-agent: MistralAI-User
Allow: /
Block MistralAI-User
User-agent: MistralAI-User
Disallow: /

xAI-Crawler

TrainingxxAI

xAI's crawler used to gather web content for training Grok and successor models. xAI publishes user-agent and IP-range information for verification.

Allow xAI-Crawler
User-agent: xAI-Crawler
Allow: /
Block xAI-Crawler
User-agent: xAI-Crawler
Disallow: /

DuckAssistBot

SearchDDuckDuckGo

DuckDuckGo's AI-answer crawler, used to power the AI-generated summaries shown above DuckDuckGo search results. Distinct from DuckDuckBot, which is the regular search crawler.

Allow DuckAssistBot
User-agent: DuckAssistBot
Allow: /
Block DuckAssistBot
User-agent: DuckAssistBot
Disallow: /

How AI crawlers differ from search engine bots

Traditional search engine bots like Googlebot and Bingbot exist to build a search index. They fetch pages, follow links, store the content, and rank pages so they appear in search results. The deal is straightforward: you let the bot in, you get visibility and referral traffic.

AI crawlers do something different. Training crawlers like GPTBot, ClaudeBot, and Bytespider download your content to train large language models. The model learns from your writing and then generates answers without sending users back to your site. There is no referral traffic in that exchange. You contribute the data, the model captures the value.

A second class of AI crawler powers AI answer engines. OAI-SearchBot, PerplexityBot, YouBot, and DuckAssistBot index the web to ground AI-generated answers and provide citations. These behave more like search engines: they index, they cite, and clicks on those citations send referral traffic. Allowing them is closer to allowing a regular search crawler.

A third class handles on-demand fetches. ChatGPT-User, claude-web, Perplexity-User, and MistralAI-User are not continuous crawlers. They fetch a single page when a real user pastes a URL into a chat. Treating these as crawlers is technically wrong: each request is a user action.

Should I allow or block AI crawlers?

There is no universally correct answer. The decision depends on your business model, your traffic sources, and your view on AI training. Use the framework below to make a deliberate choice rather than defaulting to either extreme.

Allow AI Search and Browsing crawlers if

  • You want visibility in ChatGPT Search, Perplexity, You.com, and DuckAssist answers
  • You depend on referral traffic and can monetize visits
  • Your content is editorial, news, or how-to material that benefits from broad distribution
  • You publish product information that you want surfaced in AI answers

Block AI Training crawlers if

  • Your business model depends on selling original IP, courses, or proprietary research
  • Your content is licensed or paywalled and being trained on would damage that license
  • You publish original creative work (fiction, photography, illustration) and want to keep it out of training corpora
  • You believe AI training without compensation devalues your work

Hybrid approach (most common)

  • Allow OAI-SearchBot, PerplexityBot, YouBot, DuckAssistBot for AI search visibility
  • Block GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Bytespider, CCBot for training opt-out
  • Allow ChatGPT-User, claude-web, Perplexity-User, MistralAI-User for user-initiated fetches
  • Use a CDN or WAF to enforce blocks against bots that ignore robots.txt

Common confusions and gotchas

GPTBot vs ChatGPT-User vs OAI-SearchBot

These are three independent OpenAI crawlers. GPTBot trains models. ChatGPT-User fetches on demand when a user pastes a URL. OAI-SearchBot indexes the web for ChatGPT Search. Blocking GPTBot does not block ChatGPT-User or OAI-SearchBot. You can opt out of training while still appearing in ChatGPT Search.

Google-Extended is not Googlebot

Google-Extended controls AI training only. It does not affect Google Search ranking, indexing, or visibility. Disallowing Google-Extended will not deindex your site. The same logic applies to Applebot-Extended versus regular Applebot.

ClaudeBot replaced anthropic-ai (mostly)

Anthropic consolidated its training activity under ClaudeBot. Older robots.txt files target anthropic-ai. List both in robots.txt for full coverage.

CCBot is the high-leverage block

Common Crawl data is consumed by almost every major AI lab. Blocking CCBot has more compounding effect than blocking any single lab\'s training crawler, because the same crawl feeds many models.

robots.txt is honor-based

Robots.txt is a request, not enforcement. Major commercial AI labs publicly commit to obeying it. Bytespider and a long tail of smaller scrapers do not always comply. The only enforceable block is at the edge: WAF, Cloudflare bot management, IP blocks, or rate limiting.

Configure all of these in one click

Use the AI Crawler Robots.txt Generator to allow, block, or hybrid-configure every crawler on this page in seconds.

Open Robots.txt Generator
Deep dive

The complete 2026 AI crawlers list: every user agent, what it does, and what to do about it

By Nikhil Kumar, Founder of LandKit. Last updated May 2026.

You can already see them in your access logs. The question is whether the bot calling itself GPTBot is actually GPTBot, what each one is doing with your pages, and which to wave through.

This 2026 AI crawlers list covers the 14 user agents that account for nearly every AI crawl request hitting public websites today. For each bot you get the exact user agent string, what the bot does (training, inference, search indexing), how to verify it in server logs, and a recommendation on whether to allow or block it based on your business model. Cloudflare's Q1 2026 network data shows AI crawlers now generate 22% of all bot traffic, so getting this list right has real consequences for your traffic and your training-data exposure.

What is the difference between a training crawler, an inference fetcher, and an AI search bot?

Training crawlers download pages in bulk to feed model training datasets and send back roughly zero referral traffic. Inference fetchers pull a specific URL only when a user types a prompt that needs it, behaving more like a browser than a spider. AI search bots index content for citation inside ChatGPT Search, Claude, Perplexity, and Gemini answers, and these are the only AI bots that send meaningful traffic back to your site.

The categories matter because the right block decision depends on which one you are looking at.

A training crawler has no upside for you unless you want your content in the model. An AI search bot is the closest thing to Googlebot in this stack and blocking it kills your AI citation surface.

According to Cloudflare's crawl-to-click gap analysis published in 2025, dedicated AI training crawlers generated 49.9% of all AI bot traffic by Q1 2026, hitting the 50% milestone a full quarter ahead of the previous forecast.

The crawl-to-refer ratio numbers tell the story even more brutally. Anthropic's ClaudeBot crawls 20,583 pages for every one referral it sends back, per Cloudflare's Q1 2026 Radar publisher analysis. OpenAI sits at 1,255:1. Meta sends zero.

If you want a quick pre-flight check on your own robots.txt before you start tuning anything, run it through the LandKit robots.txt validator so you are tuning a clean file, not chasing ghost rules.

Which AI crawlers should I allow vs block in 2026?

Allow every bot tied to AI search and inference (OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot, PerplexityBot, Google standard crawlers, Applebot) because these are how AI engines find and cite your content for users actively asking questions. Block training-only crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Bytespider, Meta-ExternalAgent, CCBot) if you sell content, run a niche moat, or otherwise lose by feeding the model. The mix is rarely all-or-nothing.

The framework I run for every LandKit user comes down to three questions.

Does your business depend on being cited inside AI answers? If yes, allow the search and inference bots without exception.

Is your content your moat (paywall, training data, originality)? If yes, block the training crawlers.

Are you a generic top-of-funnel content site? If yes, the most defensible default in 2026 is allow search bots, block training bots, and revisit quarterly.

Cloudflare's Q1 2026 robots.txt analysis found that GPTBot is the most blocked AI crawler on the web, followed by CCBot, ClaudeBot, and Google-Extended. The bots people actually want to block tend to be the training crawlers, which is the right instinct.

The complete 2026 AI crawlers list

Below is the working directory I keep refreshed for LandKit users. Every user agent is verified against the operator's official documentation as of May 2026. For each row, the "Purpose" column maps the bot to one of three jobs: training, inference, or search.

User agentOperatorPurposeHonors robots.txtRecommended action
GPTBotOpenAITrainingYesBlock if you protect content; allow if you want training inclusion
OAI-SearchBotOpenAISearch indexing for ChatGPT SearchYesAllow
ChatGPT-UserOpenAIUser-triggered fetchYesAllow
ClaudeBotAnthropicTrainingYesBlock if you protect content
Claude-SearchBotAnthropicSearch indexing for ClaudeYesAllow
Claude-UserAnthropicUser-triggered fetchYesAllow
Google-ExtendedGoogleGemini training token (no separate crawl)YesBlock if you protect content
GooglebotGoogleWeb search indexYesAllow (this is your search traffic)
PerplexityBotPerplexityIndexing for Perplexity answersSometimesAllow with monitoring
Perplexity-UserPerplexityUser-triggered fetchSometimesAllow with monitoring
ApplebotAppleSiri, Spotlight, Apple Intelligence searchYesAllow
Applebot-ExtendedAppleApple Intelligence training tokenYesBlock if you protect content
BytespiderByteDanceTraining for Doubao plus TikTok featuresInconsistentBlock in most cases
Meta-ExternalAgentMetaTraining plus product indexingYesBlock if you protect content
CCBotCommon CrawlPublic web archive used by most LLM trainersYesBlock if you protect content

OpenAI's official bots documentation is the canonical reference for the three OpenAI agents. Anthropic's crawler help article, updated February 20, 2026, covers all three Claude agents.

Once you have decided which to allow and block, the LandKit AI crawler robots generator will produce the exact robots.txt block in one paste.

What is the GPTBot user agent string and what does it actually do?

GPTBot is OpenAI's training crawler. The user agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot and the bot's job is to download web content that may be used to train future GPT models. It honors robots.txt and OpenAI says changes to your file propagate within roughly 24 hours. GPTBot is the most blocked AI crawler on the web in 2026, per Cloudflare's network-wide analysis.

This is the bot most publishers care about most.

GPTBot's traffic share grew from 2.2% to 7.7% of all crawler traffic between May 2024 and May 2025, a 305% increase per Cloudflare Radar. It then dropped to 9.84% of AI crawler traffic by April 2026, marking its second consecutive month of decline as Applebot picked up share.

About 25% of the top 1,000 websites now block GPTBot in robots.txt, up from 5% in early 2023, according to coverage of the same Cloudflare dataset reported by Search Engine Journal in 2025. That is the steepest opt-out adoption curve in the history of the robots.txt standard.

Blocking GPTBot has zero impact on your Google Search rankings. It also has zero impact on whether ChatGPT can cite you in live answers, because that surface uses OAI-SearchBot, which is a different agent.

What is the ClaudeBot user agent and how is it different from Claude-User?

ClaudeBot is Anthropic's training crawler. Its user agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected]). Claude-User is the user-triggered fetcher, called only when a Claude conversation pulls a specific URL. Claude-SearchBot is the indexing agent for Claude's search-style answers. The three bots can be controlled independently in robots.txt and they serve different jobs.

This separation is the cleanest in the AI industry right now.

If you block ClaudeBot, you stop new content from feeding Claude training. You do not stop Claude from citing your existing pages when a user asks about your topic.

If you block Claude-SearchBot, you do stop the citations. That is rarely what site owners want.

The framing I give LandKit operators: block ClaudeBot if your content is your moat, allow the other two so Claude users can still find and reference you. Anthropic's bots.json IP list lets you verify by IP rather than just user agent, which matters because user agent strings are trivially spoofed.

How do I recognize AI bots in server logs and tell real bots from fakes?

Look for the official user agent strings (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Claude-SearchBot, Applebot, Bytespider, Meta-ExternalAgent) in your access logs, then verify each hit with reverse DNS plus forward confirmation. A 2025 study cited by Arcjet found that 5.7% of traffic claiming to be a well-known AI crawler is fake, and the ChatGPT user agent in particular shows a 16.7% spoof rate. User agent string alone is not proof.

The real test is reverse DNS.

Run a reverse DNS lookup on the source IP. Legitimate GPTBot traffic resolves to openai.com. Then run a forward DNS lookup on that hostname and confirm the IP matches.

Anthropic, OpenAI, Google, and Apple all publish official IP ranges so you can verify without DNS calls. Arcjet's user agent identification guide walks through the full HTTP-signature approach for the more careful checks.

If you see a "GPTBot" user agent coming from a residential ASN with no openai.com reverse DNS, that is a scraper hiding behind OpenAI's name. Block it on the IP, not just the user agent.

Which AI crawlers ignore robots.txt and what do you do about them?

Perplexity is the highest-profile case. In August 2025, Cloudflare published evidence that Perplexity was using stealth, undeclared crawlers to evade robots.txt and IP blocks, switching to a Chrome-on-macOS impersonator user agent and rotating ASNs. Bytespider has a long history of inconsistent robots.txt compliance. The defensive answer is not robots.txt at all; it is WAF rules, ASN-level blocks, or a managed AI crawler firewall.

Cloudflare's August 4, 2025 stealth crawler report documented 3-6 million daily stealth requests across tens of thousands of domains, on top of the 20-25 million daily declared requests.

Cloudflare de-listed Perplexity as a verified bot in response and rolled new detection heuristics into its managed AI rules.

The lesson for the rest of us is that robots.txt is a request, not a fence. If a crawler decides to ignore it, you need an actual control plane: WAF rules, rate limiting, ASN blocks, or a dedicated AI crawler firewall like Cloudflare AI Crawl Control.

If your robots.txt is the only thing you have, run it through the robots.txt validator so you at least know it is parseable. Then layer real controls on top.

Does Google-Extended actually crawl pages or is it just a token?

Google-Extended does not crawl your site. It is a robots.txt control token that Google uses to decide whether content already crawled by Googlebot can be passed into Gemini and Vertex AI training. Disallowing Google-Extended opts your content out of Google's generative AI training without affecting Search rankings or AI Overview eligibility for already-indexed pages. The same pattern applies to Applebot-Extended.

This is one of the most misunderstood entries on the AI crawlers list.

People expect Google-Extended to show up in their server logs as a separate hit. It does not.

Googlebot does the crawling. Google-Extended is a routing flag. When you disallow Google-Extended in robots.txt, you are telling Google not to use already-fetched content for AI training. You are not stopping any HTTP request from happening.

Google announced the token on September 28, 2023, per the original TechCrunch coverage. Apple introduced Applebot-Extended with the same architecture roughly a year later. Both are pure opt-out signals.

That detail matters when you are auditing logs and thinking you have a missing user agent. You do not. There is nothing to audit.

How does blocking AI crawlers affect SEO and AI citations?

Blocking training crawlers (GPTBot, ClaudeBot, Google-Extended, Bytespider) has no effect on Google Search rankings, no effect on Bing rankings, and no effect on whether ChatGPT, Claude, or Perplexity can cite your already-existing content in live answers. Blocking AI search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) is the one move that does kill AI citations. Most operators conflate the two and lose AI visibility while trying to protect training data.

The split matters more than any other single decision in this guide.

Per OpenAI's publisher FAQ, blocking GPTBot does not affect ChatGPT search citations because those use OAI-SearchBot.

Anthropic confirmed the same separation when it updated its crawler documentation in February 2026.

The right pattern for almost every site I audit through the LandKit free SEO audit is: block training, allow search, allow inference. That defends the content asset and keeps you visible across the AI surfaces buyers actually use.

If you also have an llms.txt file telling AI crawlers what content you actually want surfaced, run it through the LandKit llms.txt validator before you ship it.

Frequently asked questions

What is the GPTBot user agent string?

The exact GPTBot user agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot. OpenAI publishes this and two others (OAI-SearchBot, ChatGPT-User) in its official bots documentation. To verify a request actually came from OpenAI, run reverse DNS on the source IP and confirm it resolves to openai.com, then forward-resolve to confirm the IP matches.

Should I block ClaudeBot if I want Claude to cite my site?

No, blocking ClaudeBot does not stop Claude from citing you. ClaudeBot is the training crawler. Citations in Claude's live answers come from Claude-SearchBot and Claude-User, which are independent agents. The standard pattern is to disallow ClaudeBot in robots.txt while allowing Claude-SearchBot and Claude-User. Anthropic confirmed this separation in its February 2026 documentation update.

What is the difference between Google-Extended and Googlebot?

Googlebot is Google's web search crawler and does the actual page fetching. Google-Extended is a robots.txt-only token that controls whether content already fetched by Googlebot can be used for Gemini and Vertex AI training. Google-Extended does not appear as its own user agent in server logs because it never makes a separate HTTP request. Disallowing it has zero impact on Google Search.

Is PerplexityBot safe to block?

PerplexityBot is the declared user agent and it does honor robots.txt when it announces itself. The problem is Cloudflare documented in August 2025 that Perplexity also uses undeclared stealth crawlers impersonating Chrome on macOS, generating 3-6 million daily requests across tens of thousands of domains. If you decide to block Perplexity, layer ASN blocks and WAF rules on top of robots.txt because robots.txt alone will not hold.

How do I verify a bot is really GPTBot and not a fake?

Run a reverse DNS lookup on the source IP address. Legitimate GPTBot traffic resolves to openai.com. Then run a forward DNS lookup on that hostname and confirm the IP matches the original request IP. A 2025 Arcjet study found 5.7% of AI crawler traffic is fake, with the ChatGPT user agent showing a 16.7% spoof rate, so user agent string alone is not enough.

Will blocking AI crawlers hurt my Google rankings?

Blocking GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Bytespider, or CCBot has no effect on Google Search rankings, Bing rankings, or AI Overview eligibility. These are training-only crawlers. Google has confirmed Google-Extended is exclusive to Gemini Apps and has no impact on Google Search. Blocking AI search bots like OAI-SearchBot or Claude-SearchBot does affect AI citations, which is a different question.

Pick the three to block this week

Open your robots.txt today. Add disallow rules for GPTBot, ClaudeBot, and CCBot if you protect your content, then verify the file parses cleanly, then watch your access logs for one week to confirm those user agents stop showing up. Layer Cloudflare AI Crawl Control or an equivalent firewall on top if Perplexity or Bytespider keep hitting after the block. Revisit the list every quarter, because this directory shifts faster than any other corner of SEO right now.

Nikhil Kumar is the founder of LandKit, the SEO and AI visibility growth OS used by solo founders and lean SaaS teams to track brand mentions across ChatGPT, Claude, Gemini, and Perplexity. He writes about AI search, technical SEO, and what actually moves traffic for under-resourced teams. Connect on LinkedIn.