LandKit

AI Crawler Robots.txt Generator

Control which AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and more) can access your site. Generates a valid robots.txt block you can paste into your site or use as a starting point.

Quick presets

Start from a sensible default, then fine-tune below.

0 allow · 0 block · 19 no rule

OpenAI

GPTBotTraining

Crawls the web to train GPT models

ChatGPT-UserBrowsing

Fetches pages on demand when ChatGPT users include URLs

OAI-SearchBotSearch

Powers ChatGPT Search results

Anthropic

ClaudeBotTraining

Crawls the web to train Claude models

anthropic-aiTraining

Legacy Anthropic crawler

claude-webBrowsing

Claude browsing tool fetcher

Google

Google-ExtendedTraining

Controls AI training (Gemini, Vertex AI). Does NOT affect normal Google Search.

Perplexity

PerplexityBotSearch

Indexes pages for Perplexity AI search

Perplexity-UserBrowsing

Fetches pages on demand for Perplexity users

Apple

Applebot-ExtendedTraining

Controls Apple Intelligence training. Does NOT affect Siri search.

Meta

Meta-ExternalAgentTraining

Used by Meta for AI training

FacebookBotMixed

Used by Meta for various crawling

ByteDance

BytespiderTraining

TikTok / ByteDance AI crawler

Common Crawl

CCBotTraining

Common Crawl, used by many AI training datasets

Other AI crawlers

DiffbotTraining

Diffbot's AI knowledge graph crawler

omgiliTraining

webz.io crawler used by various AI services

ImagesiftBotTraining

TheHive.ai image crawler

YouBotSearch

You.com search and AI crawler

AmazonbotMixed

Amazon AI crawler

Live robots.txt preview

5 lines
# AI Crawler rules generated by LandKit
# https://landkit.pro/free-tools/ai-crawler-robots-generator

# No rules selected. Toggle bots above to generate output.

How to use this robots.txt

  1. 1

    Generate your rules above.

    Toggle each AI crawler to Allow, Block or No rule. Use a preset to start fast.

  2. 2

    Add the block to your site root.

    If you do not yet have a robots.txt, download the file and place it at /robots.txt. If you already have one, paste these AI rules in (do not duplicate any existing user-agent sections).

  3. 3

    Verify with the LandKit Robots.txt Validator.

    After deploying, scan your live /robots.txt for syntax errors and confirm each rule is parsed the way you intended.

Should you allow or block AI crawlers?

There is no single right answer. Here is the honest tradeoff so you can decide based on your business.

Reasons to block

  • Protect original content (research, journalism, premium writing) from being absorbed into training data.
  • Reduce server bandwidth and CPU spend on aggressive crawlers like Bytespider and CCBot.
  • Protect a paywalled or subscriber-only product where your moat is the content itself.
  • Compliance and licensing reasons (you sell content licenses and do not want it scraped for free).

Reasons to allow

  • Get cited in ChatGPT, Perplexity, Claude and Gemini answers (the new top-of-funnel for SaaS, B2B and SEO content).
  • Brand mention volume in AI responses is becoming a measurable channel (this is the GEO opportunity).
  • Most blocking does not actually prevent scraping by rogue actors anyway; allowing the well-behaved bots costs little.
  • Your competitors who allow AI crawlers will show up in AI answers; you will not.

Common middle ground: allow AI search bots (OAI-SearchBot, PerplexityBot, Perplexity-User, ChatGPT-User) so you stay visible inside AI search products, but block training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot) so your content is not used to train new models. The "Allow search bots, block training" preset above does exactly this.

AI crawler myths debunked

Myth: Blocking GPTBot also blocks ChatGPT Search.

Truth: False. ChatGPT Search uses a separate user-agent called OAI-SearchBot. You can block GPTBot training without losing ChatGPT Search visibility.

Myth: Google-Extended affects Google Search rankings.

Truth: False. Google-Extended only controls Gemini and Vertex AI training. Googlebot remains a separate user-agent and your Google Search rankings are not affected.

Myth: Robots.txt is enough to keep AI scrapers off my content.

Truth: Not always. Robots.txt is a polite request. For hard enforcement, combine it with server-side user-agent and IP blocks, plus rate limiting at your CDN or origin.

Myth: Blocking all AI crawlers will get my content stolen less.

Truth: Mostly false. Most AI re-publication risk comes from rogue scrapers that ignore robots.txt entirely. The bigger gain from blocking is protecting your content from being used as training data, not preventing re-publication.

Deep dive

AI crawler robots.txt: how to control GPTBot, ClaudeBot, and Google-Extended in 2026

By Nikhil Kumar, Founder of LandKit. Last updated May 2026.

Most sites are still running a 2023 robots.txt against a 2026 problem. The bots changed. The rules changed. Your file probably did not.

An AI crawler robots.txt is the file that tells GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and 15-plus other AI agents which parts of your site they can read, train on, or cite in real-time answers. In 2026, you need three independent decisions per bot: training (ai-train), real-time grounding (ai-input), and search indexing (search). Cloudflare now manages this on more than 3.8 million domains, per their Content Signals Policy launch on September 24, 2025.

Why your 2023 robots.txt is broken in 2026

Two years ago, "block AI" meant one line: User-agent: GPTBot followed by Disallow: /. That worked when GPTBot was the only bot you cared about and OpenAI ran one crawler. Today there are at least three OpenAI agents, three Anthropic agents, a separate Google-Extended token, plus PerplexityBot, Bytespider, Amazonbot, Meta-ExternalAgent, Applebot-Extended, and Claude-SearchBot. A blanket disallow now blocks the bot that would have cited you alongside the one scraping your archive.

Cloudflare's data tells the real story. AI bots hit 4.2% of all HTML requests across their network in 2025, per their Year in Review. On a typical content site, AI crawlers now generate more requests than your top three search engines combined.

The blunt block also breaks visibility. Anthropic's crawl-to-refer ratio dropped from 286,930:1 in January 2025 to 38,066:1 by July 2025 after Claude added clickable citations, per Cloudflare's crawl-to-click analysis. That is still bad. It is also 7.5x better than it was, and the trajectory is the point. Block ClaudeBot today and you also opt out of every future improvement.

Should I block GPTBot or allow it?

Allow GPTBot only if you are willing to feed your content into model training without compensation or referral guarantees. As of April 2026, 62% of the top 100 US and UK news sites block GPTBot, and 79% block at least one training bot, per BuzzStream's 100-site analysis. Most operators block GPTBot but allow OAI-SearchBot so ChatGPT search can still link back. That split keeps citation upside while removing training-only consumption.

The decision is not "AI good vs AI bad." It is "what does this specific bot do and does it pay me back."

OpenAI runs three named crawlers. GPTBot trains models. OAI-SearchBot powers ChatGPT search results, which include source links. ChatGPT-User is the on-demand fetcher when a user asks Claude or ChatGPT to read a specific URL.

Block GPTBot, allow the other two, and you have a clean posture: no training, full citation eligibility.

What's the difference between ClaudeBot, Claude-User, and Claude-SearchBot?

ClaudeBot trains Anthropic's models. Claude-User is fetched on demand when a Claude user pastes your URL into a conversation. Claude-SearchBot indexes pages so Claude's search feature can cite them. All three respect robots.txt and operate independently, per Anthropic's published bot documentation. Blocking ClaudeBot does nothing to Claude-SearchBot. That granularity is new as of late 2025 and most generators have not caught up.

This matters because Anthropic's training crawler still has the worst payback on the internet. Block ClaudeBot, keep Claude-SearchBot open, and you exclude the consumption that costs you most while staying eligible for citations inside Claude's interface.

ClaudeBot is also the bot that most aggressively burns server resources on long-tail content. The April 2026 robots.txt scan from Soar Agency's analysis showed ClaudeBot mentioned in 514 of the top files measured, second only to GPTBot at 614.

The clean three-line block:

User-agent: ClaudeBot
Disallow: /

Leave Claude-User and Claude-SearchBot unmentioned and they default to allowed.

How does the Content-Signal robots.txt syntax actually work?

Content Signals are a one-line addition to robots.txt that separates three uses of your content: search (classical search indexing), ai-input (real-time grounding for AI answers), and ai-train (training and fine-tuning). Cloudflare launched the policy on September 24, 2025 and has it active on over 3.8 million domains, per Cloudflare's announcement. The IETF AIPREF working group is now standardizing the same vocabulary, with a first draft targeted for early 2026.

The syntax sits at the top of robots.txt:

User-Agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /

That single line says: index me for search, cite me in AI answers, do not use me for training. It does not require any user-agent updates. Bots that understand the signal honor it. Bots that do not still see the existing User-agent and Disallow rules below.

The catch is enforceability. Compliance is voluntary in 2026, the same way classical robots.txt has always been voluntary. Cloudflare is the first major infrastructure player pushing the policy, but the IETF draft is what turns it from a private convention into an open standard. If you want a separate sanity check on your file once you write it, the LandKit robots.txt validator flags broken syntax and missing AI-bot rules.

Will blocking AI crawlers hurt my SEO rankings?

No. Blocking GPTBot, ClaudeBot, Google-Extended, or PerplexityBot has zero direct impact on your Google or Bing search rankings. These are separate user-agents from Googlebot and Bingbot, and Google has confirmed Google-Extended is purely an AI-training token. The risk is indirect: you lose visibility inside ChatGPT, Claude, Perplexity, and Google AI Overviews, which now intercept a meaningful share of queries that used to land on your site.

Pew Research's March 2025 study of 900 US adults found that users who saw a Google AI Overview clicked through to a source 8% of the time, versus 15% without an AI Overview, per the Pew analysis. That is a 47% drop in click-through rate on AI Overview searches. Worse, users clicked the source links inside the AI Overview itself only 1% of the time.

The takeaway for ranking strategy: classical SEO traffic is shrinking faster than most operators model. You either cede the new surface (block everything, accept that AI answers will recycle competitors' content) or you take a position on which bots earn citation rights.

The 15 AI bots you actually need rules for

There are dozens of named AI crawlers in the wild. Most are noise. The 15 below cover roughly 95% of AI bot requests on a typical site in 2026, based on observed user-agent share in Cloudflare's bot analysis and Anthropic's, OpenAI's, and Google's own documentation.

BotOperatorPurposeDefault recommendation
GPTBotOpenAITraining dataBlock
OAI-SearchBotOpenAIChatGPT search citationsAllow
ChatGPT-UserOpenAIUser-triggered URL fetchAllow
ClaudeBotAnthropicTraining dataBlock
Claude-UserAnthropicUser-triggered URL fetchAllow
Claude-SearchBotAnthropicClaude search citationsAllow
Google-ExtendedGoogleGemini and AI Overviews trainingAllow if you want Gemini citations
PerplexityBotPerplexityReal-time answers and indexingAllow
Perplexity-UserPerplexityUser-triggered fetchAllow
BytespiderByteDanceTraining (TikTok, Doubao)Block
AmazonbotAmazonAlexa and Q trainingBlock
Applebot-ExtendedAppleApple Intelligence trainingBlock (allow Applebot for Siri search)
Meta-ExternalAgentMetaLlama trainingBlock
CCBotCommon CrawlOpen dataset, used by most LLMsBlock
DuckAssistBotDuckDuckGoDuckAssist answersAllow

The pattern: block the training-only bots, keep the search and on-demand fetcher bots open. That is the consensus posture across publishers in 2026.

What does a citation-friendly robots.txt actually look like

A citation-friendly file blocks training, allows search, and uses the Content-Signal line so future-compliant bots get the same answer in one read. It runs around 30 lines, names every bot explicitly, and never relies on User-agent: * to do the heavy lifting for AI policy. Most generators ship a 6-line skeleton that is functionally useless against the 2026 bot landscape. Below is a working starter that pairs with the LandKit AI crawler reference for the full bot list and update history.

# Content Signals (Cloudflare / IETF AIPREF draft)
User-Agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /

# Block training-only bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

# Allow search and citation bots (no rules = allowed by default)
# OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot,
# PerplexityBot, Perplexity-User, DuckAssistBot, Applebot all run by default

Sitemap: https://yourdomain.com/sitemap.xml

This file passes the three jobs every robots.txt has in 2026: it tells classical crawlers what to index, it tells training crawlers to leave, and it leaves citation crawlers a clear runway. Pair it with an llms.txt file at your root to summarize site architecture for LLMs that look for it; the LandKit llms.txt validator checks the format.

What about the trade-off between ai-train and ai-input

ai-train is the long tail. ai-input is today. Choosing between them is a bet on whether AI engines will eventually pay for training rights or whether you would rather give up training in exchange for being cited in real-time answers right now. As of Q1 2026, no major AI operator pays for ai-train consumption directly, but Cloudflare's pay-per-crawl pilot is live and the Cloudflare announcement on managed robots.txt puts monetization on a 12 to 18 month horizon for participating sites.

The pragmatic split for 2026: ai-train=no, ai-input=yes, search=yes. You give up training data nobody has paid for. You stay cited inside ChatGPT, Claude, Perplexity, and Gemini. You stay indexed in Google. That is the largest possible audience surface with the smallest possible giveaway.

The opposite posture, ai-train=yes, ai-input=no, search=yes, is rarer but defensible if your site is a niche reference (developer docs, API spec) where being trained into the model is more valuable than being cited per query.

The third option, blocking everything, costs you measurable visibility. After looking at how AI Overviews already gate 47% of clicks per Pew, "block all AI" is also "concede the new surface to your competitors." A free LandKit SEO audit will show you which AI engines currently cite your site and which ignore you, which is the real input for this trade-off.

How often do I need to update my robots.txt for new AI bots

Quarterly minimum, monthly if you are a publisher. New AI bots launch every 4 to 6 weeks. Anthropic added Claude-User and Claude-SearchBot in late 2025, OpenAI added OAI-SearchBot in mid-2024, and Apple's Applebot-Extended landed quietly in 2024 alongside Apple Intelligence. ai-blocking by reputable sites grew from 23% in September 2023 to nearly 60% by May 2025, per analysis cited in Search Engine Journal. The robots.txt you wrote 12 months ago is missing at least three live bots.

The maintenance pattern that works: pin a calendar reminder for the first Monday of each quarter. Pull the latest user-agent list from a maintained source like the GitHub ai-robots-txt repo, diff against your file, paste in any new entries, ship it.

If you want to skip the manual work, the LandKit free-tools hub maintains a live AI crawler reference and a robots.txt generator that pulls the current bot list every time you generate.

Frequently asked questions

Does blocking GPTBot hurt my Google rankings?

No. GPTBot is OpenAI's crawler, completely separate from Googlebot, and Google has confirmed publicly that AI-bot blocking does not affect search rankings. Blocking GPTBot only removes your content from OpenAI's training pipeline. The indirect cost is reduced visibility inside ChatGPT answers, but ChatGPT search itself uses OAI-SearchBot, which is a separate user-agent you can keep allowed.

How do I block AI bots from training on my content but still let them cite me?

Block the training-specific user-agents (GPTBot, ClaudeBot, Google-Extended, Bytespider, CCBot, Meta-ExternalAgent, Applebot-Extended) and leave the search and on-demand fetcher bots unmentioned (OAI-SearchBot, Claude-SearchBot, Claude-User, ChatGPT-User, PerplexityBot). Add the Content-Signal line ai-train=no, ai-input=yes, search=yes at the top so AIPREF-compliant bots get the same answer in one read.

What's the difference between ai-train, ai-input, and search in Content-Signal?

ai-train means using your content to train or fine-tune a model. ai-input means using your content as real-time context for an AI answer (RAG, grounding, search citations). Search means classical search indexing without AI summarization. The three are independent. You can allow search and ai-input while blocking ai-train, which is the most common posture among publishers as of Q1 2026.

Do AI bots actually respect robots.txt?

The major ones do. OpenAI, Anthropic, Google, Perplexity, and Apple all publicly commit to honoring robots.txt directives. Smaller scrapers and some open-source crawlers ignore the file entirely, which is why robots.txt is preference signaling, not enforcement. For hard blocking, use Cloudflare's bot management or server-level user-agent rules. For policy expression, robots.txt is the standard.

Should a small blog block AI crawlers in 2026?

If you write original analysis, block training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) and allow citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot). If you publish syndicated or aggregated content, the trade-off matters less because you have weaker citation upside. Solo blogs that block everything see measurable drops in AI-mention share, which is the new traffic surface most operators underweight.

How do I know if my robots.txt is working?

Check Google Search Console for the robots.txt tester, run the file through a syntax validator, then watch your server logs for the user-agents named in your rules. If GPTBot still hits paths under Disallow inside 30 days of the change, the bot is non-compliant or your file has a typo. The LandKit robots.txt validator catches syntax errors that Google's tester misses for AI-specific user-agents.

Pick your posture and ship the file

Most operators are still running a 2023 robots.txt against a 2026 bot landscape, and that is the real opportunity. Decide your three positions: train, cite, search. Write a 30-line file that names the bots explicitly, adds the Content-Signal line, and gets reviewed every quarter. Then point your strategy at the surface that actually matters now, which is whether ChatGPT, Claude, Perplexity, and Gemini cite your brand when buyers ask for it. Track that with LandKit's growth OS and you stop guessing about AI visibility.

Nikhil Kumar is the founder of LandKit, the SEO and AI visibility growth OS for solo operators and small teams. He writes about how independent founders win on Google and inside AI engines without enterprise budgets. Find him on LinkedIn.