Question 1

How does the tester decide allow vs block?

Accepted Answer

It implements Google’s robots.txt parsing algorithm: find the most specific User-agent group whose token appears in the requested user-agent string, then within that group apply the longest matching pattern (Allow wins on tie). This matches Googlebot’s behaviour for the cases that matter.

Question 2

What does the AI crawler preset block?

Accepted Answer

GPTBot (OpenAI), ChatGPT-User, OAI-SearchBot, CCBot (Common Crawl, used by many AI training datasets), anthropic-ai and ClaudeBot (Anthropic), Claude-Web, Google-Extended (Google’s flag for AI training opt-out, separate from Googlebot), PerplexityBot, Applebot-Extended, meta-externalagent (Meta), Bytespider (ByteDance), Diffbot, and Omgilibot. This is the current set of well-known AI training crawlers.

Question 3

Will blocking AI crawlers actually stop AI training?

Accepted Answer

For crawlers that respect robots.txt — yes. For crawlers that don’t — no. robots.txt is a polite request, not enforcement. Most reputable AI companies do honor it (OpenAI, Anthropic, Google all publicly committed to respecting their declared user-agents). For stricter control, combine robots.txt blocks with server-side user-agent rejection and a Cloudflare AI bot block rule.

Question 4

What pattern syntax is supported?

Accepted Answer

* matches any sequence of characters. $ at the end anchors to end-of-URL. So `Disallow: /*.pdf$` blocks all PDFs but not PDF-named directories. Other characters are matched literally. Patterns are anchored to the start of the path implicitly.

Question 5

Why does Disallow: /admin/ not block /admin (without trailing slash)?

Accepted Answer

Because robots.txt patterns are literal prefix matches. /admin/ does not match the URL /admin. To block both, use either Disallow: /admin (no slash, matches /admin and /admin/...) or write two rules.

robots.txt Tester

robots.txt

Check a URL

Parsed groups

Sitemaps

About this tool

Frequently asked questions