Free AI Crawler Robots.txt Generator, GPTBot + More

AI crawler robots.txt: how to control GPTBot, ClaudeBot, and Google-Extended in 2026

By Nikhil Kumar, Founder of LandKit. Last updated May 2026.

Most sites are still running a 2023 robots.txt against a 2026 problem. The bots changed. The rules changed. Your file probably did not.

An AI crawler robots.txt is the file that tells GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and 15-plus other AI agents which parts of your site they can read, train on, or cite in real-time answers. In 2026, you need three independent decisions per bot: training (ai-train), real-time grounding (ai-input), and search indexing (search). Cloudflare now manages this on more than 3.8 million domains, per their Content Signals Policy launch on September 24, 2025.

Why your 2023 robots.txt is broken in 2026

Two years ago, "block AI" meant one line: User-agent: GPTBot followed by Disallow: /. That worked when GPTBot was the only bot you cared about and OpenAI ran one crawler. Today there are at least three OpenAI agents, three Anthropic agents, a separate Google-Extended token, plus PerplexityBot, Bytespider, Amazonbot, Meta-ExternalAgent, Applebot-Extended, and Claude-SearchBot. A blanket disallow now blocks the bot that would have cited you alongside the one scraping your archive.

Cloudflare's data tells the real story. AI bots hit 4.2% of all HTML requests across their network in 2025, per their Year in Review. On a typical content site, AI crawlers now generate more requests than your top three search engines combined.

The blunt block also breaks visibility. Anthropic's crawl-to-refer ratio dropped from 286,930:1 in January 2025 to 38,066:1 by July 2025 after Claude added clickable citations, per Cloudflare's crawl-to-click analysis. That is still bad. It is also 7.5x better than it was, and the trajectory is the point. Block ClaudeBot today and you also opt out of every future improvement.

Should I block GPTBot or allow it?

Allow GPTBot only if you are willing to feed your content into model training without compensation or referral guarantees. As of April 2026, 62% of the top 100 US and UK news sites block GPTBot, and 79% block at least one training bot, per BuzzStream's 100-site analysis. Most operators block GPTBot but allow OAI-SearchBot so ChatGPT search can still link back. That split keeps citation upside while removing training-only consumption.

The decision is not "AI good vs AI bad." It is "what does this specific bot do and does it pay me back."

OpenAI runs three named crawlers. GPTBot trains models. OAI-SearchBot powers ChatGPT search results, which include source links. ChatGPT-User is the on-demand fetcher when a user asks Claude or ChatGPT to read a specific URL.

Block GPTBot, allow the other two, and you have a clean posture: no training, full citation eligibility.

What's the difference between ClaudeBot, Claude-User, and Claude-SearchBot?

ClaudeBot trains Anthropic's models. Claude-User is fetched on demand when a Claude user pastes your URL into a conversation. Claude-SearchBot indexes pages so Claude's search feature can cite them. All three respect robots.txt and operate independently, per Anthropic's published bot documentation. Blocking ClaudeBot does nothing to Claude-SearchBot. That granularity is new as of late 2025 and most generators have not caught up.

This matters because Anthropic's training crawler still has the worst payback on the internet. Block ClaudeBot, keep Claude-SearchBot open, and you exclude the consumption that costs you most while staying eligible for citations inside Claude's interface.

ClaudeBot is also the bot that most aggressively burns server resources on long-tail content. The April 2026 robots.txt scan from Soar Agency's analysis showed ClaudeBot mentioned in 514 of the top files measured, second only to GPTBot at 614.

The clean three-line block:

User-agent: ClaudeBot
Disallow: /

Leave Claude-User and Claude-SearchBot unmentioned and they default to allowed.

How does the Content-Signal robots.txt syntax actually work?

Content Signals are a one-line addition to robots.txt that separates three uses of your content: search (classical search indexing), ai-input (real-time grounding for AI answers), and ai-train (training and fine-tuning). Cloudflare launched the policy on September 24, 2025 and has it active on over 3.8 million domains, per Cloudflare's announcement. The IETF AIPREF working group is now standardizing the same vocabulary, with a first draft targeted for early 2026.

The syntax sits at the top of robots.txt:

User-Agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /

That single line says: index me for search, cite me in AI answers, do not use me for training. It does not require any user-agent updates. Bots that understand the signal honor it. Bots that do not still see the existing User-agent and Disallow rules below.

The catch is enforceability. Compliance is voluntary in 2026, the same way classical robots.txt has always been voluntary. Cloudflare is the first major infrastructure player pushing the policy, but the IETF draft is what turns it from a private convention into an open standard. If you want a separate sanity check on your file once you write it, the LandKit robots.txt validator flags broken syntax and missing AI-bot rules.

Will blocking AI crawlers hurt my SEO rankings?

No. Blocking GPTBot, ClaudeBot, Google-Extended, or PerplexityBot has zero direct impact on your Google or Bing search rankings. These are separate user-agents from Googlebot and Bingbot, and Google has confirmed Google-Extended is purely an AI-training token. The risk is indirect: you lose visibility inside ChatGPT, Claude, Perplexity, and Google AI Overviews, which now intercept a meaningful share of queries that used to land on your site.

Pew Research's March 2025 study of 900 US adults found that users who saw a Google AI Overview clicked through to a source 8% of the time, versus 15% without an AI Overview, per the Pew analysis. That is a 47% drop in click-through rate on AI Overview searches. Worse, users clicked the source links inside the AI Overview itself only 1% of the time.

The takeaway for ranking strategy: classical SEO traffic is shrinking faster than most operators model. You either cede the new surface (block everything, accept that AI answers will recycle competitors' content) or you take a position on which bots earn citation rights.

The 15 AI bots you actually need rules for

There are dozens of named AI crawlers in the wild. Most are noise. The 15 below cover roughly 95% of AI bot requests on a typical site in 2026, based on observed user-agent share in Cloudflare's bot analysis and Anthropic's, OpenAI's, and Google's own documentation.

Bot	Operator	Purpose	Default recommendation
GPTBot	OpenAI	Training data	Block
OAI-SearchBot	OpenAI	ChatGPT search citations	Allow
ChatGPT-User	OpenAI	User-triggered URL fetch	Allow
ClaudeBot	Anthropic	Training data	Block
Claude-User	Anthropic	User-triggered URL fetch	Allow
Claude-SearchBot	Anthropic	Claude search citations	Allow
Google-Extended	Google	Gemini and AI Overviews training	Allow if you want Gemini citations
PerplexityBot	Perplexity	Real-time answers and indexing	Allow
Perplexity-User	Perplexity	User-triggered fetch	Allow
Bytespider	ByteDance	Training (TikTok, Doubao)	Block
Amazonbot	Amazon	Alexa and Q training	Block
Applebot-Extended	Apple	Apple Intelligence training	Block (allow Applebot for Siri search)
Meta-ExternalAgent	Meta	Llama training	Block
CCBot	Common Crawl	Open dataset, used by most LLMs	Block
DuckAssistBot	DuckDuckGo	DuckAssist answers	Allow

The pattern: block the training-only bots, keep the search and on-demand fetcher bots open. That is the consensus posture across publishers in 2026.

What does a citation-friendly robots.txt actually look like

A citation-friendly file blocks training, allows search, and uses the Content-Signal line so future-compliant bots get the same answer in one read. It runs around 30 lines, names every bot explicitly, and never relies on User-agent: * to do the heavy lifting for AI policy. Most generators ship a 6-line skeleton that is functionally useless against the 2026 bot landscape. Below is a working starter that pairs with the LandKit AI crawler reference for the full bot list and update history.

# Content Signals (Cloudflare / IETF AIPREF draft)
User-Agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /

# Block training-only bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

# Allow search and citation bots (no rules = allowed by default)
# OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot,
# PerplexityBot, Perplexity-User, DuckAssistBot, Applebot all run by default

Sitemap: https://yourdomain.com/sitemap.xml

This file passes the three jobs every robots.txt has in 2026: it tells classical crawlers what to index, it tells training crawlers to leave, and it leaves citation crawlers a clear runway. Pair it with an llms.txt file at your root to summarize site architecture for LLMs that look for it; the LandKit llms.txt validator checks the format.

What about the trade-off between ai-train and ai-input

ai-train is the long tail. ai-input is today. Choosing between them is a bet on whether AI engines will eventually pay for training rights or whether you would rather give up training in exchange for being cited in real-time answers right now. As of Q1 2026, no major AI operator pays for ai-train consumption directly, but Cloudflare's pay-per-crawl pilot is live and the Cloudflare announcement on managed robots.txt puts monetization on a 12 to 18 month horizon for participating sites.

The pragmatic split for 2026: ai-train=no, ai-input=yes, search=yes. You give up training data nobody has paid for. You stay cited inside ChatGPT, Claude, Perplexity, and Gemini. You stay indexed in Google. That is the largest possible audience surface with the smallest possible giveaway.

The opposite posture, ai-train=yes, ai-input=no, search=yes, is rarer but defensible if your site is a niche reference (developer docs, API spec) where being trained into the model is more valuable than being cited per query.

The third option, blocking everything, costs you measurable visibility. After looking at how AI Overviews already gate 47% of clicks per Pew, "block all AI" is also "concede the new surface to your competitors." A free LandKit SEO audit will show you which AI engines currently cite your site and which ignore you, which is the real input for this trade-off.

How often do I need to update my robots.txt for new AI bots

Quarterly minimum, monthly if you are a publisher. New AI bots launch every 4 to 6 weeks. Anthropic added Claude-User and Claude-SearchBot in late 2025, OpenAI added OAI-SearchBot in mid-2024, and Apple's Applebot-Extended landed quietly in 2024 alongside Apple Intelligence. ai-blocking by reputable sites grew from 23% in September 2023 to nearly 60% by May 2025, per analysis cited in Search Engine Journal. The robots.txt you wrote 12 months ago is missing at least three live bots.

The maintenance pattern that works: pin a calendar reminder for the first Monday of each quarter. Pull the latest user-agent list from a maintained source like the GitHub ai-robots-txt repo, diff against your file, paste in any new entries, ship it.

If you want to skip the manual work, the LandKit free-tools hub maintains a live AI crawler reference and a robots.txt generator that pulls the current bot list every time you generate.

Frequently asked questions

Does blocking GPTBot hurt my Google rankings?

No. GPTBot is OpenAI's crawler, completely separate from Googlebot, and Google has confirmed publicly that AI-bot blocking does not affect search rankings. Blocking GPTBot only removes your content from OpenAI's training pipeline. The indirect cost is reduced visibility inside ChatGPT answers, but ChatGPT search itself uses OAI-SearchBot, which is a separate user-agent you can keep allowed.

How do I block AI bots from training on my content but still let them cite me?

Block the training-specific user-agents (GPTBot, ClaudeBot, Google-Extended, Bytespider, CCBot, Meta-ExternalAgent, Applebot-Extended) and leave the search and on-demand fetcher bots unmentioned (OAI-SearchBot, Claude-SearchBot, Claude-User, ChatGPT-User, PerplexityBot). Add the Content-Signal line ai-train=no, ai-input=yes, search=yes at the top so AIPREF-compliant bots get the same answer in one read.

What's the difference between ai-train, ai-input, and search in Content-Signal?

ai-train means using your content to train or fine-tune a model. ai-input means using your content as real-time context for an AI answer (RAG, grounding, search citations). Search means classical search indexing without AI summarization. The three are independent. You can allow search and ai-input while blocking ai-train, which is the most common posture among publishers as of Q1 2026.

Do AI bots actually respect robots.txt?

The major ones do. OpenAI, Anthropic, Google, Perplexity, and Apple all publicly commit to honoring robots.txt directives. Smaller scrapers and some open-source crawlers ignore the file entirely, which is why robots.txt is preference signaling, not enforcement. For hard blocking, use Cloudflare's bot management or server-level user-agent rules. For policy expression, robots.txt is the standard.

Should a small blog block AI crawlers in 2026?

If you write original analysis, block training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) and allow citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot). If you publish syndicated or aggregated content, the trade-off matters less because you have weaker citation upside. Solo blogs that block everything see measurable drops in AI-mention share, which is the new traffic surface most operators underweight.

How do I know if my robots.txt is working?

Check Google Search Console for the robots.txt tester, run the file through a syntax validator, then watch your server logs for the user-agents named in your rules. If GPTBot still hits paths under Disallow inside 30 days of the change, the bot is non-compliant or your file has a typo. The LandKit robots.txt validator catches syntax errors that Google's tester misses for AI-specific user-agents.

Pick your posture and ship the file

Most operators are still running a 2023 robots.txt against a 2026 bot landscape, and that is the real opportunity. Decide your three positions: train, cite, search. Write a 30-line file that names the bots explicitly, adds the Content-Signal line, and gets reviewed every quarter. Then point your strategy at the surface that actually matters now, which is whether ChatGPT, Claude, Perplexity, and Gemini cite your brand when buyers ask for it. Track that with LandKit's growth OS and you stop guessing about AI visibility.

Nikhil Kumar is the founder of LandKit, the SEO and AI visibility growth OS for solo operators and small teams. He writes about how independent founders win on Google and inside AI engines without enterprise budgets. Find him on LinkedIn.

AI Crawler Robots.txt Generator

Quick presets

OpenAI

Anthropic

Google

Perplexity

Apple

Meta

ByteDance

Common Crawl

Other AI crawlers

Advanced: custom disallow paths

Live robots.txt preview

How to use this robots.txt

Should you allow or block AI crawlers?

Reasons to block

Reasons to allow

AI crawler myths debunked