SEO & AI Engine Optimization Framework · May 2026

robots.txt & Crawl Budget: directives, allow/disallow, crawl prioritization

A comprehensive operational reference for robots.txt directive engineering, the 2026 AI crawler access matrix (30+ named bots across training, search, and user agent categories), sitemap…

The Bot Access Control Layer, the AI Crawler Decision Matrix, and the Operational Reference for Crawl Budget Diagnosis

A comprehensive operational reference for robots.txt directive engineering, the 2026 AI crawler access matrix (30+ named bots across training, search, and user agent categories), sitemap declaration patterns, the crawl budget concept and when it actually matters, server log analysis for crawl diagnostics, and the layered defenses that protect staging, admin, and private content. This document is a tactical companion to framework-technicalseo.md, which lists robots.txt in its audit checklist but offers no depth on the directives themselves or on the AI bot decisions that now define editorial posture. Dual purpose: install reference and audit document.

Cross stack implementation note: the code samples in this framework are written in plain HTML and nginx config for clarity. For React, Vue, Svelte, Next.js, Nuxt, SvelteKit, Astro, Hugo, 11ty, Remix, WordPress, Shopify, and Webflow equivalents of the build time robots.txt generation patterns, see framework-cross-stack-implementation.md. For pure client rendered SPAs see framework-react.md. For framework specific sitemap generation see framework-nextjs.md and framework-wordpress.md.


1. Document Purpose

1.1 What This Document Is

This is the canonical operational reference for the robots.txt file and the crawl budget concept. In 2020 the document would have fit on two pages. In 2026 it is a framework of its own, because the bot landscape has tripled, AI crawler access has become an editorial decision rather than a technical default, and the consequences of getting the directives wrong have moved from "lost some crawl budget" to "your site is invisible to ChatGPT, Claude, and Perplexity."

The file at /robots.txt is the single most consequential 500 bytes on a website. It tells every named crawler on the public web which paths it may request. It is read on every visit by well behaved bots. It is the only public statement a site makes about its bot access posture. A misconfigured line in robots.txt can remove a site from search results within 72 hours. A correctly configured robots.txt is invisible. The asymmetry is the reason this framework exists.

1.2 What This Document Is Not

It is not a robots.txt generator. It is not a CMS plugin manual. It is not a list of "bots to block" copied from another article. It is the operational reasoning behind every line a builder might write in a robots.txt file, so the builder can defend the decision when a client asks why GPTBot is allowed but Bytespider is not.

It is also not a security mechanism. Anything genuinely sensitive belongs behind authentication, not behind a Disallow rule. The robots.txt file is a public, advisory protocol read by polite crawlers. Hostile actors ignore it. Several "AI" crawlers in 2025 have been documented ignoring it. Section 8 covers the layered defense pattern that actually protects content.

1.3 Three Operating Modes

Mode A, Install Mode. Author a robots.txt for a new site or rewrite an existing one. Follow Sections 2 through 12 in order. Most engagements land here.

Mode B, Audit Mode. Evaluate an existing robots.txt and the surrounding crawl posture against the rubric. Skip to Section 13.

Mode C, Hybrid Mode. Audit first, install fixes for the failing rubric items. The default for an inherited site.

1.4 How Claude Code CLI Should Consume This Document

  1. Read Section 2 and collect client variables. The AI crawler posture and faceted navigation flag determine which downstream sections matter most.
  2. Read Sections 3 and 4 to ground the directive reasoning. These sections explain why each line in a robots.txt exists.
  3. Read Section 5 and the four pillars context in Section 5.13 to make the AI crawler decision matrix concrete.
  4. Apply Sections 6 through 8 to author the robots.txt, declare sitemaps, and configure the staging defense layer.
  5. Apply Section 9 only if the site exceeds the thresholds described there. Most sites do not.
  6. Apply Section 10 when crawl issues surface in Google Search Console (GSC) or in server logs.
  7. Apply Section 11 if the client wants llms.txt infrastructure alongside robots.txt. The two files complement each other but solve different problems.
  8. Use Section 12 as a final QA gate. Section 13 is the audit rubric. Section 14 is the maintenance cadence.

1.5 Conflict Resolution Rules

Conflict Rule
Existing robots.txt blocks /wp-content/ or /assets/ and the client wants Google to render JavaScript correctly Critical. Unblock immediately. Googlebot needs CSS and JS to render. Source: Google Search Central documentation, 2024.
Client wants to block all AI crawlers because "AI is stealing our content" Discuss the trade. Blocking blocks citation too. Search and answer surfaces are the new top of funnel for many query categories. Section 5 lays out the decision matrix.
Existing robots.txt uses Crawl-delay: 10 and the client is concerned about Googlebot load Googlebot ignores Crawl-delay. Bingbot honors it. Use GSC crawl rate settings for Google. Source: Google Search Central, 2024 update on unsupported directives.
Existing robots.txt has Disallow: / left over from a staging deploy Emergency. The site is being de-indexed. Fix in deploy minutes, request re-crawl in GSC, monitor coverage for 7 days.
Client wants noindex enforced on a directory and tries to use Disallow to achieve it Educational moment. Disallow blocks crawling, not indexing. Section 3 explains the distinction. Use noindex meta or X-Robots-Tag instead.
Faceted navigation is generating millions of crawled URLs and burning crawl budget Apply Section 9 parameter handling pattern. URL fragments, Disallow for non indexable parameters, canonicals for indexable ones.
Client has a 250 page brochure site and wants to "optimize crawl budget" No. Crawl budget does not matter at that scale. Spend the engagement on content and links. Section 9.1 has the threshold.

1.6 Required Tools


2. Client Variables Intake YAML

# ============================================
# ROBOTS.TXT AND CRAWL BUDGET FRAMEWORK
# CLIENT VARIABLES INTAKE
# ============================================

# --- Site Identity (REQUIRED) ---
primary_domain: ""
hosting_environment: ""               # bubbles_debian_nginx | shared_host | other
sitemap_url: ""                       # absolute URL of the primary sitemap or sitemap index
sitemap_url_count: 0                  # how many URLs are in the sitemap
total_indexable_url_count: 0          # rough estimate, used for crawl budget threshold
faceted_nav_present: false            # ecommerce filters, faceted search, parameter heavy
parameter_examples: []                # e.g. ["?color=", "?sort=", "?page=", "?utm_"]

# --- Search Crawler Posture (REQUIRED) ---
google_search_console_verified: false
bing_webmaster_verified: false
crawl_issues_in_gsc: false            # any "Discovered not indexed" or "Crawled not indexed" spike
crawl_issues_in_logs: false           # 5xx clusters, repeated 404s, bot hammering
historical_googlebot_drop: false      # have crawl rates fallen unexpectedly

# --- AI Crawler Posture (REQUIRED) ---
# For each, set "allow", "disallow", or "undecided"
ai_posture_gptbot: ""
ai_posture_oai_searchbot: ""
ai_posture_chatgpt_user: ""
ai_posture_claudebot: ""
ai_posture_claude_searchbot: ""
ai_posture_claude_user: ""
ai_posture_perplexitybot: ""
ai_posture_perplexity_user: ""
ai_posture_google_extended: ""
ai_posture_applebot_extended: ""
ai_posture_meta_externalagent: ""
ai_posture_ccbot: ""
ai_posture_bytespider: ""
ai_posture_diffbot: ""
ai_posture_imagesiftbot: ""
ai_posture_amazonbot: ""
ai_posture_seo_tool_crawlers: ""      # AhrefsBot, SemrushBot, MJ12bot, DotBot
client_reasoning_documented: false    # did the client provide a rationale for blocks

# --- Sensitive Areas (REQUIRED) ---
admin_paths: []                       # /wp-admin/, /admin/, /backend/
staging_subdomains: []                # staging.example.com, dev.example.com
auth_required_paths: []               # /account/, /dashboard/, /portal/
parameter_only_pages: []              # search results, internal filters

# --- llms.txt Posture (RECOMMENDED) ---
llms_txt_present: false
llms_full_txt_present: false
llms_txt_strategy: ""                 # "skip", "minimal", "comprehensive"

# --- Server Log Access (REQUIRED for audit) ---
nginx_log_path: ""                    # /var/log/nginx/access.log on bubbles
log_retention_days: 0
log_analysis_tool: ""                 # "awk_grep", "goaccess", "screaming_frog", "none"

# --- Decision Authority (REQUIRED) ---
client_owns_robots_decision: true     # is the agency authorized to change robots.txt
review_cadence: ""                    # "monthly", "quarterly", "ad_hoc"

The AI crawler posture block is the editorial heart of the intake. Every "disallow" decision should have a documented business reason. Vague concerns about "AI training" without specifics are the most common driver of self inflicted invisibility damage.


3. What Robots.txt Is and What It Isn't

3.1 The Single Most Common Misunderstanding

The robots.txt file blocks crawling. It does not block indexing.

When a Disallow rule prevents a crawler from requesting a URL, the crawler does not fetch the page body. It does not see the content. It does not see the <meta name="robots"> tag. It cannot apply noindex. But if the URL is linked from somewhere else on the public web, Google can still index the URL itself, with no content, showing only the URL and possibly a snippet from the link anchor text. The result is a URL appearing in search results with the SERP message "A description for this result is not available because of this site's robots.txt." Source: Google Search Central documentation, robots.txt introduction, 2024.

The correct way to prevent indexing is to allow crawling and serve noindex. The crawler fetches the page, sees the directive, and removes the URL from the index. If the URL is also blocked in robots.txt, the noindex is invisible to the crawler and ineffective.

This is the failure mode behind a large fraction of "this page keeps showing up in search but I told Google not to" support threads. The fix is always the same. Remove the Disallow. Serve noindex. Wait for the recrawl.

3.2 What Robots.txt Actually Controls

It controls:

  1. Whether a polite bot requests a path
  2. Which User-agent the bot identifies as when applying rules to itself
  3. The location of one or more sitemaps the bot can use to discover content
  4. (For Bingbot and a handful of other crawlers) the throttling applied via Crawl-delay

It does not control:

  1. Whether a URL is indexed (use noindex for that)
  2. Whether a hostile bot honors any of the directives (it is advisory, not enforceable)
  3. The crawl rate of Googlebot (use Search Console settings)
  4. Anything about authenticated paths (use HTTP auth and access controls)

3.3 The Three Failure Modes

The file at /robots.txt produces three kinds of failure when misconfigured:

Over blocking. The directive blocks paths that should be crawled. CSS and JavaScript are the canonical examples. Googlebot needs them to render. Blocking them produces "Crawled but indexed without rendering" entries in GSC and ranking degradation. A November 2024 industry study attributed search visibility damage to robots.txt configuration errors across a majority of audited sites. Source: industry SEO research, 2024 study of 73 percent of audited sites with errors.

Under blocking. The directive allows paths that should be blocked. Parameterized URLs proliferate. Internal search result pages get crawled and sometimes indexed. Crawl budget burns on duplicate content. Faceted navigation explodes into millions of URLs. Section 9 covers this in depth.

Wrong tool selected. The directive tries to prevent indexing instead of blocking crawling. Section 3.1 is the failure mode. The fix is to use the correct tool, which is noindex, not Disallow.

3.4 The File Mechanics

The file must be served at the protocol and host root, exactly at /robots.txt. Subdirectory placement does nothing. Per RFC 9309, the Robots Exclusion Protocol specification published in 2022, a crawler fetches https://example.com/robots.txt and applies the rules it finds there to any URL under https://example.com/. A separate file is required for https://www.example.com/ if that host is canonical, and a separate file for each subdomain. Source: RFC 9309, Robots Exclusion Protocol, 2022.

The response must be HTTP 200, Content-Type: text/plain (most crawlers tolerate other text types but text/plain is the spec), and well under 500 KB. Google truncates the file at 500 KB and ignores everything after. Source: Google Search Central, How Google Interprets the robots.txt Specification, 2024.

A 4xx response is interpreted as "no robots.txt" and the crawler proceeds to crawl everything it can find. A 5xx response is interpreted as "site is unhealthy, back off" and the crawler may temporarily suspend crawls. A 5xx response sustained for hours can produce dramatic drops in crawl rate and indexing volume.


4. The Robots Exclusion Protocol

4.1 The Five Directives

The RFC 9309 specification defines a minimal directive set. The five directives that matter operationally:

User-agent: <name>
Disallow: <path-prefix>
Allow: <path-prefix>
Sitemap: <absolute-url>
Crawl-delay: <seconds>

Of these, Crawl-delay is non standard. RFC 9309 does not include it. Some crawlers honor it. Most modern crawlers do not. Section 4.6 covers the support matrix.

4.2 User-agent

The User-agent line declares which crawler the following rules apply to. The token is matched case insensitively against the User-Agent string the crawler sends in its HTTP request. The special token * matches every crawler not otherwise named.

User-agent: Googlebot
Disallow: /private/

User-agent: *
Disallow: /admin/

Above, Googlebot is told to skip /private/. Every other crawler is told to skip /admin/. Critically, Googlebot is not told to skip /admin/, because once a User-agent group matches, that crawler ignores other groups. This is one of the most consequential rules in the protocol and the source of many subtle bugs. Source: Google Search Central, robots.txt specification, 2024.

A crawler with no matching User-agent group falls back to the * group. A crawler with a matching User-agent group ignores the * group entirely. If Bingbot should also skip /admin/, the rule must be repeated under the Bingbot group:

User-agent: Bingbot
Disallow: /admin/
Crawl-delay: 5

User-agent: *
Disallow: /admin/

4.3 Disallow

Disallow blocks the crawler from requesting paths matching the prefix. Path matching is case sensitive on URL paths (though the path matching is case sensitive, the User-agent matching is case insensitive). The prefix /admin matches /admin, /admin/, /admin/users, and /administrator. To be specific about a directory boundary, end the path with a slash:

Disallow: /admin/        # matches /admin/anything but NOT /administrator

The wildcard * matches any sequence of characters. The terminator $ matches end of URL. Examples:

Disallow: /*.pdf$        # block any URL ending in .pdf
Disallow: /*?utm_*       # block any URL with utm_ parameter
Disallow: /search?*      # block /search?q=anything

Google and Bing both support * and $. Less mainstream crawlers may not. RFC 9309 does not require wildcard support. Source: RFC 9309, 2022, and Google Search Central robots.txt specification, 2024.

Disallow: with empty value is equivalent to "allow everything" for that User-agent group. Disallow: / is "block everything." The two characters between them are the difference between a healthy site and an invisible one.

4.4 Allow

Allow overrides a broader Disallow for a specific path. The rule of thumb is the more specific (longer matching prefix) rule wins. Source: Google Search Central, robots.txt specification, 2024.

User-agent: Googlebot
Disallow: /private/
Allow: /private/public-summary/

Googlebot is blocked from /private/ except for the specific subpath /private/public-summary/. The Allow rule is necessary because Disallow: /private/ would otherwise also block /private/public-summary/.

Most major crawlers support Allow. CCBot, Bingbot, Googlebot, GPTBot, ClaudeBot, and PerplexityBot all honor it. Older or simpler crawlers sometimes ignore Allow and obey only Disallow. The safe pattern is to write Disallow rules narrowly enough that Allow overrides are rarely needed.

4.5 Sitemap

Sitemap declares the absolute URL of a sitemap or sitemap index. Multiple Sitemap lines are allowed. The directive is not associated with any User-agent group. It is a top level statement.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Sitemap: https://example.com/sitemap-images.xml

Google, Bing, and most other major crawlers honor the directive. Source: sitemaps.org specification, 2008, with the Sitemap autodiscovery extension added in 2007. Section 6 covers sitemap declaration in depth.

4.6 Crawl-delay and Request-rate

Crawl-delay requests a minimum number of seconds between successive requests from the named crawler. Request-rate requests a fractional crawl rate (requests per second). Neither is part of RFC 9309. Support varies:

Crawler Crawl-delay Request-rate Notes
Googlebot Ignored Ignored Source: Google Search Central, robots.txt unsupported directives, 2024. Use GSC crawl rate setting instead.
Bingbot Honored, integer seconds only Ignored Source: Bing Webmaster blog, 2009 and subsequent confirmation 2024.
GPTBot Honored Ignored Source: OpenAI bot documentation, 2024.
ClaudeBot Honored Ignored Source: Anthropic Help Center, claude bot documentation, 2025 and February 2026 update.
PerplexityBot Honored Ignored Source: Perplexity bot documentation, 2025.
CCBot Honored Ignored Source: Common Crawl CCBot documentation, 2024.
YandexBot Honored Honored One of the few.
Bytespider Inconsistent Inconsistent Documented to ignore in 2023 industry reports.

The operational implication for a self hosted nginx site on Bubbles or similar Debian infrastructure is that Crawl-delay is a useful brake on Bingbot and on some AI crawlers when they hammer the server. It does nothing for Googlebot. Google publishes crawl rate controls in Search Console for the rare case when reducing Googlebot crawl is needed.

4.7 What Google Specifically Ignores

In addition to Crawl-delay and Request-rate, Google ignores the following directives that appear in legacy robots.txt files:

Source: Google Search Central, robots.txt unsupported directives documentation, 2024. The April 2024 update consolidated the list and confirmed Google is unlikely to add support for any of them.

A November 2024 industry survey of robots.txt files in the wild found that approximately one in every thousand files contained the unsupported noindex directive, almost always alongside conflicting supported directives. Source: industry SEO research, 2024 robots.txt audit study, sample of approximately 200,000 sites.


5. AI Crawler Access Strategy

5.1 The 2026 Landscape

In 2020 the bot landscape was small. Googlebot, Bingbot, Applebot, a handful of social network preview bots, and the major SEO tool crawlers. A robots.txt could fit in five lines.

In 2026 the landscape includes more than 30 named AI crawlers across training, search and retrieval, user agent, and undeclared categories. New bots emerge multiple times per year. The decision of which to allow and which to disallow is now an editorial position, comparable to deciding which industry publications a business pitches to. Source: industry crawler reports, 2025 ecosystem mapping and Cloudflare Radar 2025 Year in Review, January 2026.

The five functional categories of AI user agents in 2026:

Training crawlers. Fetch content to train large language models. Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google DeepMind, technically a token not a crawler), Applebot-Extended (Apple Intelligence, also a token), Meta-ExternalAgent (Meta AI), CCBot (Common Crawl). Blocking these prevents training inclusion. It does not prevent search or answer surface inclusion.

Search and retrieval crawlers. Fetch content to build real time retrieval indexes. Examples: OAI-SearchBot (OpenAI's search index), Claude-SearchBot (Anthropic's search), PerplexityBot (Perplexity's index), Bingbot (which powers Microsoft Copilot answers). Blocking these removes the site from the corresponding answer engine entirely. Source: OpenAI bot documentation, 2024 and Anthropic Help Center, 2025.

User triggered fetchers. Fetch a URL when a human asks the assistant to read it. Examples: ChatGPT-User (OpenAI), Claude-User (Anthropic), Perplexity-User (Perplexity), Google-Agent (Google's user triggered fetcher). These are not bulk crawlers. They visit a single URL when prompted. Blocking them means a user asking "summarize this page" gets a "cannot access" message. Source: OpenAI bot documentation, 2024.

Opt out tokens. Robots.txt directives that control training only. Examples: Google-Extended, Applebot-Extended. These are not crawlers. They make no HTTP requests. They never appear in server logs. They are tokens recognized by existing crawlers (Googlebot in the case of Google-Extended) as instructions about how the fetched content may be used downstream. Source: industry crawler reports, 2025 user agent reference.

Undeclared and masquerading traffic. Crawlers that ignore robots.txt or spoof browser user agent strings. Documented examples: Bytespider's aggressive crawl behavior; an August 2025 industry crawler report documenting stealth crawlers using generic Chrome user agents and rotating ASN ranges to evade no crawl directives. Source: industry crawler reports, August 2025 stealth crawler analysis.

5.2 The Decision Matrix

For each named crawler, the business decides allow or disallow. The decision has trade offs. The matrix below captures the trade for the major bots a 2026 business will encounter:

Bot Operator Category Allow Benefit Allow Cost Default Recommendation
Googlebot Google Search Search visibility None Allow
Bingbot Microsoft Search Bing visibility, powers Copilot None Allow
GPTBot OpenAI Training ChatGPT model includes site knowledge Content used in training Allow for most businesses, disallow for licensed content
OAI-SearchBot OpenAI Search ChatGPT search cites the site None Allow
ChatGPT-User OpenAI User triggered User prompted summaries work None Allow
ClaudeBot Anthropic Training Claude model includes site knowledge Content used in training Allow for most businesses
Claude-SearchBot Anthropic Search Claude search cites the site None Allow
Claude-User Anthropic User triggered User prompted summaries work None Allow
PerplexityBot Perplexity Search Perplexity answers cite the site None Allow. Perplexity is the most citation rich AI engine.
Perplexity-User Perplexity User triggered User prompted fetches work None Allow
Google-Extended Google Token Site included in Gemini training and AI Overviews training Content used in training Allow. Disallowing affects AI Overview citation eligibility.
Applebot-Extended Apple Token Site included in Apple Intelligence training Content used in training Allow for most businesses
Meta-ExternalAgent Meta Training Meta AI model includes site knowledge Content used in training Allow
CCBot Common Crawl Training Site available in every Common Crawl based model (GPT, Claude, Llama, many others) Content widely distributed Allow. Blocking CCBot removes the site from foundational LLM training data across the industry. Source: Common Crawl FAQ, 2024.
Bytespider ByteDance Training TikTok/Doubao model knowledge Content used in training, aggressive crawl rate Allow with caution. Documented to ignore robots.txt and to hammer servers. Use server level rate limiting. Source: industry crawler reports, 2023 to 2025.
Diffbot Diffbot Knowledge graph Site present in Diffbot's knowledge graph (used by multiple downstream services) Content scraped Allow unless content licensing prevents it
ImagesiftBot ImageSift Training Site images included in image AI training Image content used Allow for image rich sites
Amazonbot Amazon Training Site knowledge available to Amazon AI Content used in training Allow

The matrix is a starting point, not a verdict. Every business has different constraints. A law firm in a jurisdiction with content licensing concerns may disallow all training crawlers and allow only search retrieval bots. A B2B SaaS company aiming for ChatGPT and Perplexity citation may allow everything except undeclared traffic. A media company with paid content may disallow everything that does not pay for licensed access.

5.3 The Common Mistake: Block All AI

A pattern observed in 2024 and 2025 in robots.txt files across many sites: a blanket block of every AI crawler the operator can name, often copy pasted from a "block AI bots" blog post.

# anti-pattern: blanket AI block, no rationale
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Google-Extended
User-agent: CCBot
User-agent: anthropic-ai
User-agent: Meta-ExternalAgent
User-agent: Applebot-Extended
Disallow: /

The intent is "do not let AI use our content." The effect is "do not let AI cite our content." The block is the same directive for both training and search retrieval crawlers, because the operator does not distinguish between the categories. The result is the business loses visibility on every AI surface where users are increasingly asking questions, and gains nothing because the major training crawlers had already harvested the content before the block was added, or because Common Crawl had already harvested it and is still distributed.

If the business genuinely does not want its content surfacing in AI answers, this is a coherent (if costly) editorial position. If the business simply has not thought through the trade, this is self inflicted invisibility. The audit question is always: what specific business outcome does the disallow protect?

5.4 The Permissive Default Pattern

For most small to medium businesses without licensing constraints, the recommended default is permissive across all polite crawlers, with explicit handling of the few documented bad actors:

# Search crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Applebot
Allow: /

# AI search and retrieval (citation engines)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# AI training crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: CCBot
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: Diffbot
Allow: /

# Aggressive or problematic crawlers
User-agent: Bytespider
Disallow: /

# SEO tool crawlers (decided per client)
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: DotBot
Disallow: /

# Default
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /*?utm_
Disallow: /*?fbclid=
Disallow: /*?gclid=
Disallow: /*?sessionid=

Sitemap: https://example.com/sitemap.xml

The pattern says: search engines welcome, AI search and answer engines welcome, AI training crawlers welcome, the documented aggressive crawler is blocked, the SEO tool crawlers are blocked to deny competitors free backlink intelligence (this is a defensible default for small businesses, not a universal rule). The default * group catches everyone else with a sensible parameter and admin block list.

5.5 The Restrictive Default Pattern

For a business with content licensing concerns, regulatory constraints, or paid content models, the inverse pattern is also coherent:

# Allow only search crawlers that drive paid clicks
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Allow citation engines (they drive clicks to source)
User-agent: PerplexityBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Block training crawlers (no compensation for training value)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Default: deny anything we have not explicitly allowed
User-agent: *
Disallow: /

Sitemap: https://example.com/sitemap.xml

This pattern is defensible for publishers, paywalled content sites, and businesses with strict content licensing requirements. The default Disallow: / is the unusual move. It signals that the site does not want general purpose crawling, which can affect aggregators and feed readers.

5.6 The Citation Optimized Pattern

For a business explicitly optimizing for AI search citation (the dominant pattern for thatdeveloperguy.com style consulting engagements), the recommended posture is fully permissive for citation engines and selective on training:

# Allow all polite search crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Applebot
Allow: /

# Allow all citation surfaces
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Allow training crawlers (citation builds on training corpus presence)
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: CCBot
Allow: /

# Block undeclared and aggressive
User-agent: Bytespider
Disallow: /

# Block competitor SEO intelligence by default
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

# Default
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /*?utm_
Disallow: /*?fbclid=

Sitemap: https://example.com/sitemap.xml

This is the default for the agency's own properties and for client engagements where the goal is AI citation. Cross reference: framework-aicitations.md for the broader AI citation strategy.

5.7 GPTBot Specifically

OpenAI publishes three named bots. Each has a distinct purpose and can be controlled independently. Source: OpenAI bot documentation, 2024.

# Training crawler
User-agent: GPTBot
# Identifies as: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot

# Search retrieval crawler (powers ChatGPT search results)
User-agent: OAI-SearchBot
# Identifies as: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot

# User triggered fetcher (when a person asks ChatGPT to read a page)
User-agent: ChatGPT-User
# Identifies as: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot

The 24 hour delay between a robots.txt change and OpenAI's systems reflecting the change is documented. Source: OpenAI bot documentation, 2024. Plan accordingly when launching or rolling back a posture change.

The agentic browser product ("ChatGPT Operator") fetches pages while taking action on behalf of a user. It is a user triggered category fetcher. Blocking it via Disallow blocks the user's ability to ask the agent to interact with the site. As of late 2025, the operator agent uses the same ChatGPT-User token, though OpenAI documentation indicates the agent identification may evolve.

5.8 ClaudeBot and the Anthropic Trio

Anthropic publishes three named bots as of the February 2026 documentation update. Source: Anthropic Help Center, claude bot documentation, February 2026.

# Training crawler
User-agent: ClaudeBot
# Replaces legacy anthropic-ai and claude-web tokens

# Search retrieval crawler
User-agent: Claude-SearchBot

# User triggered fetcher
User-agent: Claude-User

The legacy anthropic-ai and claude-web tokens are deprecated as of the February 2026 update but appear in older logs. Including them in robots.txt does no harm.

# Legacy tokens, deprecated but safe to include
User-agent: anthropic-ai
User-agent: claude-web

The Claude Code CLI (the tool authoring this very document, for example) identifies as claude-code. It is a developer tool that fetches URLs when an engineer issues a command. Blocking it has limited effect; most Claude Code usage is on private intranets or with explicit user consent.

5.9 PerplexityBot Specifically

Perplexity publishes two named bots. Source: Perplexity bot documentation, 2025.

# Index building crawler
User-agent: PerplexityBot
# IP addresses published at perplexity.com/perplexitybot.json

# User triggered fetcher
User-agent: Perplexity-User
# IP addresses published at perplexity.com/perplexity-user.json

A blocked PerplexityBot does not entirely remove the domain from Perplexity. Per the company's documentation, when PerplexityBot is disallowed, Perplexity may still index the domain name, headline, and a brief factual summary. The full text indexing is what the directive blocks. Source: Perplexity bot documentation, 2025.

An August 2025 industry crawler report documented Perplexity stealth crawlers using generic Chrome user agent strings and rotating ASN ranges to bypass robots.txt directives. Source: industry crawler reports, August 2025 stealth crawler analysis. Perplexity disputed the findings. The operational implication for a self hosted site is that robots.txt is necessary but not sufficient for sites that intend to block Perplexity. Server level user agent filtering and rate limiting via nginx are complementary defenses.

5.10 Google-Extended and Applebot-Extended

These are tokens, not crawlers. They never appear in server logs. They are read by Googlebot and Applebot respectively to determine how the fetched content may be used downstream. Source: industry crawler reports, 2025 user agent reference.

# Opt out of Google AI training (Gemini, AI Overviews, Vertex AI)
# Does NOT affect Googlebot search inclusion
User-agent: Google-Extended
Disallow: /

# Opt out of Apple Intelligence training
# Does NOT affect Applebot search inclusion
User-agent: Applebot-Extended
Disallow: /

Disallowing Google-Extended removes the site from Gemini's training data and may affect citation eligibility in Google AI Overviews. The exact correlation between Google-Extended permission and AI Overview citation is not formally documented by Google but multiple 2025 industry analyses suggest a positive relationship. Source: industry SEO research, 2025 AIO citation correlation studies.

5.11 Meta-ExternalAgent

Meta publishes meta-externalagent as its primary AI training crawler. Source: Meta crawler documentation, 2024 to 2025. A distinct preview crawler, the older facebookexternalhit, continues to fetch Open Graph metadata for link previews on Facebook and Instagram. The training crawler and the preview crawler should be controlled separately:

# Meta AI training (block to opt out of Meta AI training)
User-agent: Meta-ExternalAgent
Disallow: /

# Meta link preview (do not block, used for OG previews on Facebook/Instagram shares)
User-agent: facebookexternalhit
Allow: /

The Meta-WebIndexer crawler is documented as Meta's search retrieval bot for Meta AI search results. Source: Meta crawler documentation, 2025. Allowing it supports citation in Meta AI responses.

5.12 CCBot and the Common Crawl Question

CCBot is Common Crawl's foundational web crawler. The resulting Common Crawl dataset is published monthly and is one of the foundational training data sources for virtually every major large language model, including GPT, Claude, Llama, PaLM, and many others. Source: Common Crawl FAQ, 2024.

Blocking CCBot removes the site from this distribution. The block applies forward (future Common Crawl snapshots will not include the site) but does not retroactively remove the site from existing snapshots already in distribution. A site that was crawled by CCBot in 2022 and later blocks CCBot in 2026 will still appear in the 2022 snapshot used to train models released through 2026.

The operational implication is that blocking CCBot in 2026 is mostly symbolic for sites with longer histories. For new sites or sites that have only recently come online, the block prevents future training corpus inclusion across the entire LLM industry. The trade is real: training corpus presence correlates positively with citation propensity in 2026 AI engines.

5.13 Four Pillars Context

The AI crawler access decision is the foundation of the GEO pillar (Generative Engine Optimization) in the four pillars visibility architecture: SEO (classic ranking, ten blue links), AEO (Answer Engine Optimization, featured snippets, voice answers), AIO (AI Overview Optimization, Google AI Overviews specifically; see framework-aioverviews.md), GEO (Generative Engine Optimization, broader AI citation across ChatGPT, Claude, Perplexity, and other engines; see framework-aicitations.md).

The SEO pillar requires Googlebot and Bingbot access. The AEO pillar requires Googlebot access plus the schema and content patterns covered in framework-featuredsnippets.md. The AIO pillar requires Googlebot access plus Google-Extended access (for AI Overview citation eligibility), plus the AIO substrate patterns in framework-aioverviews.md. The GEO pillar requires permissive access to GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Meta-ExternalAgent, and CCBot, plus the GEO patterns covered in framework-aicitations.md and framework-searchgpt.md.

A robots.txt that blocks GPTBot has surrendered the OpenAI search and answer surface entirely. A robots.txt that blocks Google-Extended has reduced AIO citation eligibility. A robots.txt that blocks PerplexityBot has surrendered the Perplexity surface. These are not theoretical losses. They show up as zero citations on tracked query sets across the affected engines within 30 to 90 days of the block. Cross reference: framework-perplexityspaces.md for Perplexity specific strategy.


6. Sitemap Declaration in Robots.txt

6.1 The Sitemap Directive

The Sitemap: directive in robots.txt is the most widely supported sitemap discovery mechanism. Every major search engine and AI search crawler reads it. The directive is independent of User-agent groups; it is a top level statement.

Sitemap: https://example.com/sitemap.xml

The URL must be absolute (including scheme and host) and must be a fully qualified URL. Relative paths are ignored. The URL should match the canonical host (HTTPS, with or without www per site convention).

6.2 Multiple Sitemaps

Multiple Sitemap: lines are allowed. Each declares an additional sitemap or sitemap index.

Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-images.xml
Sitemap: https://example.com/sitemap-news.xml

For sites with separate sitemaps for distinct content types (pages, posts, images, news, video, hreflang), declaring all of them ensures crawlers discover all of them. The alternative is a sitemap index file:

Sitemap: https://example.com/sitemap-index.xml

The sitemap index references the individual sitemaps. For sites with more than 50,000 URLs or more than a single content type, the sitemap index pattern is the standard. Source: sitemaps.org specification, 2008 with subsequent updates.

6.3 Sitemap Validation

Before declaring a sitemap in robots.txt, verify it:

# Verify the sitemap returns 200 with correct content type
curl -I https://example.com/sitemap.xml

# Expect: HTTP/2 200, Content-Type: application/xml or text/xml

# Verify the sitemap parses as XML
curl -s https://example.com/sitemap.xml | head -20

# Count URLs declared
curl -s https://example.com/sitemap.xml | grep -c '<loc>'

The sitemap must:

<changefreq> and <priority> are ignored by Google. Source: Google Search Central, sitemap best practices, 2024. Including them is harmless but unnecessary.

6.4 The Sitemap Index Pattern

For sites with more than 50,000 indexable URLs or with separate content types, the sitemap index is the standard structure. The index is itself an XML file, served at a canonical location, referenced from robots.txt:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-05-13</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-05-13</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-images.xml</loc>
    <lastmod>2026-05-13</lastmod>
  </sitemap>
</sitemapindex>

Each referenced sitemap is then a standard URL list sitemap:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page-1/</loc>
    <lastmod>2026-05-13</lastmod>
  </url>
  <url>
    <loc>https://example.com/page-2/</loc>
    <lastmod>2026-05-13</lastmod>
  </url>
</urlset>

6.5 Robots.txt Sitemap Placement

The Sitemap: directive can appear anywhere in the file. Convention places it at the bottom, after all User-agent blocks. This is purely stylistic; crawlers read it regardless of position.

User-agent: *
Allow: /
Disallow: /admin/

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

6.6 Sitemap Submission Beyond Robots.txt

Declaring the sitemap in robots.txt makes it discoverable. Direct submission accelerates discovery:

  1. Submit the sitemap URL in Google Search Console, Sitemaps section
  2. Submit the sitemap URL in Bing Webmaster Tools, Sitemaps section
  3. For Yandex (rarely needed for US English language sites), submit via Yandex Webmaster
  4. For Naver (Korean market), submit via Naver Webmaster Tools
  5. For IndexNow supporting engines (Bing, Yandex, Seznam, Naver, DuckDuckGo via Bing), publish content via IndexNow ping rather than waiting for sitemap crawl

Cross reference: framework-technicalseo.md Section 8 for the full sitemap structure reference and framework-migration.md for sitemap handling during URL migrations.


7. User Agent Specificity Patterns

7.1 The Three Architectural Patterns

A robots.txt can be structured three ways. Each has trade offs.

Pattern 1: Broad allow with specific disallow. The default group allows everything; specific User-agent groups disallow specific bots. This is the recommended default for most sites in 2026.

User-agent: Bytespider
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/

Sitemap: https://example.com/sitemap.xml

The named bots get the Disallow: /. Everyone else (including all the AI search and citation crawlers) falls back to the * group which is permissive.

Pattern 2: Specific allow with broad disallow. The default group denies everything; specific User-agent groups allow specific bots. This is the restrictive pattern, suitable for paywalled content sites or sites with licensing constraints.

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Disallow: /

Sitemap: https://example.com/sitemap.xml

The named bots get explicit allow. Everyone else gets blocked by the * group default Disallow: /. This pattern requires maintenance every time a new desirable bot emerges; if Perplexity launches a new variant or Apple launches an AI search bot, the file needs updating or the new bot is blocked by default.

Pattern 3: Explicit allow lists across the board. Every bot of interest is explicitly named with an explicit posture. The default * group is restrictive or absent.

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /

This pattern is the most expressive and the most maintenance heavy. Recommended for sites with strong editorial opinions about which crawlers are welcome.

7.2 The Specificity Hierarchy

When a crawler matches multiple User-agent groups, only the most specific group applies. The matching is by string match against the User-Agent header, longest match wins.

User-agent: GoogleBot                          # matches GoogleBot, GoogleBot-News, GoogleBot-Image
Disallow: /private/

User-agent: GoogleBot-News                     # more specific, takes precedence for GoogleBot-News
Disallow: /draft-articles/

A GoogleBot-News crawler sees the second group and applies only its rule (Disallow: /draft-articles/). It does not also apply the broader Googlebot rule. This is the most consequential interaction in the protocol and the source of many subtle bugs. Source: Google Search Central, robots.txt specification, 2024.

The implication: if a more specific group is added for a bot, it must replicate any relevant rules from broader groups, because the broader groups no longer apply to it. The Bing documentation makes this explicit, recommending repetition of relevant directives in bot specific sections. Source: Bing Webmaster blog, 2018 robots.txt tip.

7.3 The Safest Defaults

For an unfamiliar site where the engagement is brief and the editorial posture has not been worked through, the safest starting robots.txt is:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /backend/
Disallow: /*?utm_
Disallow: /*?fbclid=
Disallow: /*?gclid=
Disallow: /*?sessionid=

Sitemap: https://example.com/sitemap.xml

This file:

It is a defensible default that can be customized once the engagement scope deepens and the editorial posture is worked through. It is also the file that should be in place during the first day of any inherited site engagement before any other changes are made.


8. Protecting Staging and Admin Environments

8.1 The Layered Defense Pattern

Anything genuinely sensitive belongs behind authentication, not behind a Disallow rule. The robots.txt file is a public document. Adding Disallow: /admin/ advertises the existence of an admin path. Adding Disallow: /staging/ advertises the existence of a staging environment. Hostile actors read robots.txt for reconnaissance. Source: industry security research, 2020 to 2025 reconnaissance pattern analysis.

The layered defense pattern uses four layers, each with a distinct purpose:

Layer 1: HTTP basic authentication. nginx serves a 401 challenge for protected paths. Bots and humans without credentials get nothing. This is the only layer that actually prevents access.

Layer 2: IP allowlist. nginx restricts the protected paths to a small set of known IPs (office, VPN, deploy server). Even with credentials, requests from unknown IPs are rejected with 403.

Layer 3: X-Robots-Tag headers. For paths that must be public but should not appear in any index (rendered PDFs, JSON data files, version controlled assets), nginx sends X-Robots-Tag: noindex, nofollow in the HTTP response.

Layer 4: Robots.txt Disallow. As a final advisory layer, the robots.txt file disallows the path for polite crawlers. This is the least effective layer in isolation; it is the cherry on top of the actual defenses.

8.2 Layer 1: nginx HTTP Basic Authentication

On Bubbles or any Debian nginx host, basic auth is a few lines per server block:

# Inside /etc/nginx/sites-available/example.com
server {
    listen 443 ssl http2;
    server_name example.com;

    # Public site
    location / {
        root /var/www/sites/example.com/public;
        try_files $uri $uri/ =404;
    }

    # Protected admin (basic auth)
    location /admin/ {
        auth_basic "Admin Area";
        auth_basic_user_file /var/www/sites/example.com/.htpasswd;
        root /var/www/sites/example.com/private;
    }
}

Create the password file:

sudo apt install apache2-utils
sudo htpasswd -c /var/www/sites/example.com/.htpasswd admin
sudo chown www-data:www-data /var/www/sites/example.com/.htpasswd
sudo chmod 640 /var/www/sites/example.com/.htpasswd

Validate and reload:

sudo nginx -t && sudo systemctl reload nginx

Anyone hitting /admin/ without credentials gets a 401 challenge. Bots see the 401 and do not enter the protected area. The path is invisible to indexing because crawlers cannot access it.

8.3 Layer 2: IP Allowlist

For sites where the admin interface should only be reachable from specific IPs (office, VPN, deploy server), nginx can restrict by IP before the basic auth challenge fires:

location /admin/ {
    # IP allowlist
    allow 192.0.2.0/24;        # office network
    allow 198.51.100.42;       # deploy server
    allow 100.64.0.0/10;       # tailscale CGNAT range
    deny all;

    # Basic auth on top of IP allowlist
    auth_basic "Admin Area";
    auth_basic_user_file /var/www/sites/example.com/.htpasswd;

    root /var/www/sites/example.com/private;
}

Requests from disallowed IPs get a 403. Requests from allowed IPs get the basic auth challenge. Validate and reload:

sudo nginx -t && sudo systemctl reload nginx

8.4 Layer 3: X-Robots-Tag for Non HTML Resources

For paths that must be publicly accessible but should not be indexed (rendered PDFs at /downloads/, JSON data files at /data/, build artifacts at /assets/build/), HTTP headers signal indexing intent without preventing access:

# PDF documents that should not be indexed
location ~* \.pdf$ {
    add_header X-Robots-Tag "noindex, nofollow" always;
}

# JSON data files
location /data/ {
    add_header X-Robots-Tag "noindex, nofollow" always;
}

# Build artifacts
location /assets/build/ {
    add_header X-Robots-Tag "noindex" always;
}

The always flag ensures the header is sent even on error responses. Validate and reload:

sudo nginx -t && sudo systemctl reload nginx

Googlebot, Bingbot, and most AI crawlers honor X-Robots-Tag. Source: Google Search Central, robots meta tag and X-Robots-Tag specification, 2024.

8.5 Layer 4: Robots.txt Disallow as Advisory Layer

After layers 1 through 3 are in place, the robots.txt Disallow becomes the final advisory layer:

User-agent: *
Disallow: /admin/
Disallow: /staging/
Disallow: /backend/
Disallow: /data/
Disallow: /downloads/private/

This layer assumes the polite crawler is also blocked by the upstream layers; the Disallow simply asks the crawler not to attempt a request in the first place, saving server resources. Hostile actors ignore it. Authentication and IP allowlist are what actually defend.

8.6 Staging Subdomains

The pattern most often missed: a separate robots.txt for the staging subdomain. Staging sites usually live at staging.example.com or dev.example.com. They are separate hosts. The robots.txt at the production domain does not control them. They need their own file at https://staging.example.com/robots.txt:

User-agent: *
Disallow: /

Sitemap:

Combined with basic auth and (preferably) IP allowlist:

# /etc/nginx/sites-available/staging.example.com
server {
    listen 443 ssl http2;
    server_name staging.example.com;

    # Allow only known IPs
    allow 192.0.2.0/24;
    allow 198.51.100.42;
    allow 100.64.0.0/10;
    deny all;

    # Basic auth on top
    auth_basic "Staging";
    auth_basic_user_file /var/www/sites/staging.example.com/.htpasswd;

    root /var/www/sites/staging.example.com/public;

    # Also tell polite crawlers to stay out
    location = /robots.txt {
        add_header Content-Type "text/plain";
        return 200 "User-agent: *\nDisallow: /\n";
    }

    # Belt and suspenders X-Robots-Tag
    add_header X-Robots-Tag "noindex, nofollow" always;
}

Validate and reload:

sudo nginx -t && sudo systemctl reload nginx

The trifecta of IP allowlist, basic auth, and Disallow: / means staging environments do not leak into search indexes. Cross reference: framework-security.md for broader security posture and framework-migration.md for the staging to production transition pattern (where Disallow: / becomes Allow: / on launch day).


9. The Crawl Budget Concept

9.1 When It Matters

Crawl budget is the number of URLs Googlebot crawls in a given period. Google defines it as "the set of URLs that Google can and wants to crawl." Source: Google Search Central, Crawl Budget Management documentation, 2024.

For most sites, crawl budget does not matter. Gary Illyes of Google's Search Relations team stated on Search Off the Record that "most people don't have to worry about it, and when I say most, it's probably over 90 percent of sites on the internet don't have to worry about it." Source: Google Search Off the Record podcast, 2024.

The Google documentation lists three scenarios where crawl budget management does matter:

Large sites with 1 million+ unique pages and content that changes moderately often (once a week or more).

Medium to large sites with 10,000+ unique pages and content that changes very rapidly (daily or more).

Sites with a large portion of their total URLs classified as "Discovered but currently not indexed" in Google Search Console.

Source: Google Search Central, Crawl Budget Management for Large Sites, 2024. These are not exact thresholds. A 5,000 page ecommerce site with faceted navigation generating 500,000 URL variations may have a real crawl budget problem. A 100,000 page brochure site with no parameters and stable content may have no crawl budget problem.

The threshold to assess is not page count but URL count, including all parameterized variants Googlebot actually requests. Section 10 covers the diagnostic.

9.2 What Crawl Budget Is Made Of

Google's documentation decomposes crawl budget into two factors:

Crawl rate. How many requests per second Googlebot can make to the site without harming server performance. Determined by Google's measurement of server response times and error rates.

Crawl demand. How many URLs Googlebot wants to crawl on the site. Determined by the perceived freshness needs of the content, the popularity of URLs (more popular URLs get crawled more often), and site wide signals like recent site moves.

Crawl budget is the smaller of crawl rate and crawl demand. A site with fast servers (high crawl rate) but boring static content (low crawl demand) gets crawled less than a site with the same crawl rate and frequently updated content. A site with high crawl demand but slow servers gets throttled to protect server health.

The only ways to increase crawl budget per Google's documentation are to increase the serving capacity of the site (faster servers, lower response times) and to increase the value of the content (more popular pages get more crawl demand). Source: Google Search Central, Crawl Budget Management, 2024.

9.3 Signs of Crawl Budget Waste

In Google Search Console, several signals indicate crawl budget is being wasted on low value URLs:

Many URLs in the "Crawled but currently not indexed" bucket of the Pages report. Google fetched the URL, looked at the content, decided not to add it to the index. The crawl was wasted from an indexing perspective.

Many URLs in the "Discovered but currently not indexed" bucket. Google knows about the URL (from a link or sitemap) but has not crawled it yet because crawl budget is exhausted on other URLs. This is the most direct signal of crawl budget waste.

Slow propagation of new content to the index. A new page sits at "Discovered but not indexed" for weeks before being crawled.

In the Crawl Stats report (under Settings in GSC), high crawl rates on parameter heavy URLs, internal search results, faceted nav variants, or pagination URLs. These are URLs that rarely should be in the index but consume crawl budget at the expense of pages that should be.

The Crawl Stats report is the single best instrument for diagnosing crawl budget waste. It shows daily crawl request counts by file type, response code, purpose (discovery vs refresh), and Googlebot type. Spikes in 404 responses or in low value URL types signal waste.

9.4 Faceted Navigation, the Primary Cause

Faceted navigation is the most common source of crawl budget waste. A product listing with filters for color, size, price, brand, sort order, and pagination produces a combinatorial explosion of URLs. A modest faceted product listing can produce millions of URLs from a few hundred actual products. Source: Google Search Central blog, Crawling December 2024, Faceted Navigation post.

/products/?color=red
/products/?color=red&size=large
/products/?color=red&size=large&sort=price-asc
/products/?color=red&size=large&sort=price-asc&page=2
/products/?color=red&size=large&sort=price-asc&page=2&utm_source=email

Each variant is a distinct URL to Googlebot. Each one consumes crawl budget. Most are near duplicates of each other. The signal Google extracts from crawling them is minimal. The signal lost from not crawling the actual product pages is substantial.

Google's 2024 guidance on faceted navigation is direct:

If faceted URLs do not need to be indexed, block them in robots.txt. Source: Google Search Central blog, Crawling December 2024.

If facet parameters control behavior that should be invisible to search (sorting, pagination of duplicate content), use URL fragments (#) instead of query parameters. Crawlers ignore fragments.

If faceted URLs do need to be indexed (a specific facet combination is a meaningful entry point), use canonicalization to consolidate signals and limit crawlable combinations.

Return 404 for facet combinations with no results. Empty result pages should not consume crawl budget.

The pattern for a parameter heavy site:

# Block parameter combinations that should not be crawled
User-agent: *
Disallow: /products/*?*color=
Disallow: /products/*?*sort=
Disallow: /products/*?*page=
Disallow: /products/*?*utm_

# Allow the canonical product listings
Allow: /products/
Allow: /products/$

Combined with rel="canonical" on every product listing variant pointing to the canonical no parameter version, and combined with Disallow for non indexable parameters in robots.txt, the crawl budget waste is contained.

9.5 Parameter Handling Patterns

The four patterns for URL parameters, ranked by safety:

URL fragments. Use # instead of ? when the parameter controls only display behavior (sort order, pagination of an effective duplicate). Crawlers ignore the fragment. The URL /products#sort=price is the same URL as /products to a crawler.

Robots.txt Disallow. Use for parameters that produce noindex worthy content. Disallow: /*?utm_ blocks tracking parameters. Disallow: /*?sessionid= blocks session IDs. The crawler does not waste a request on these URLs.

Canonical tag. Use for parameters that produce a meaningful but duplicate page. The ?color=red variant has a canonical pointing to the no parameter version. Crawlers still request the URL but understand the canonical, consolidating signals.

Noindex meta. Use for parameters that produce a meaningful but non indexable page. The page is fetched, the noindex is read, the URL is excluded from the index. Crawl budget is still consumed for the fetch but the index stays clean.

Cross reference: framework-ecommerceseo.md for faceted navigation strategy in ecommerce contexts.

9.6 The 10,000 URL Threshold Heuristic

A working heuristic for when crawl budget actually deserves engineering attention:

URL count Content freshness Crawl budget priority
Under 1,000 Any Not a concern
1,000 to 10,000 Stable Not a concern
1,000 to 10,000 Daily updates Modest concern
10,000 to 100,000 Stable Modest concern
10,000 to 100,000 Daily updates Real concern
100,000 to 1,000,000 Any Real concern
Over 1,000,000 Any Critical concern

The right column corresponds to engagement intensity. "Not a concern" means do not spend audit hours on crawl budget. "Modest concern" means run the diagnostic in Section 10 once and act on findings. "Real concern" means quarterly crawl budget review. "Critical concern" means dedicated crawl budget engineering with monthly monitoring.

Most thatdeveloperguy.com client engagements (local services, small B2B, content sites under 5,000 pages) sit in "not a concern." Time spent on crawl budget is time not spent on content and links. Cross reference: framework-technicalseo.md Section 3.3 for the original threshold guidance.


10. Diagnosing Crawl Issues

10.1 Server Log Analysis on Bubbles

The nginx access log is the canonical source of truth for what crawlers actually do on a site. On Bubbles (Debian + nginx), the log is at /var/log/nginx/access.log by default, with rotation producing access.log.1, access.log.2.gz, etc.

The default nginx log format includes the remote IP, timestamp, request, status, bytes, referer, and user agent. The user agent is the field used to identify the crawler.

Basic command line diagnostics:

# Count requests per crawler in the most recent log
awk -F'"' '{print $6}' /var/log/nginx/access.log \
  | grep -oE 'Googlebot|Bingbot|GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|Claude-SearchBot|Claude-User|PerplexityBot|Perplexity-User|Google-Extended|CCBot|Bytespider|Applebot|Meta-ExternalAgent|AhrefsBot|SemrushBot|MJ12bot' \
  | sort | uniq -c | sort -rn

This produces a table of crawler request counts:

   12453 Googlebot
    3219 Bingbot
    1842 GPTBot
    1655 PerplexityBot
     921 ClaudeBot
     412 CCBot
     289 Applebot
      53 Bytespider

The output is the operational evidence of which crawlers are actually visiting. Crawlers that should be visiting but are not are diagnostic gold. A site that has been live for 90 days with zero GPTBot visits should investigate whether GPTBot can reach the site, whether the IP range is being blocked at the firewall level, and whether the robots.txt accidentally disallows it.

10.2 Crawler Response Code Analysis

The next diagnostic is response codes by crawler:

# Googlebot status code distribution
awk -F'"' '$6 ~ /Googlebot/ {split($0, a, " "); print a[9]}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn

Expected output:

   11203 200
     642 304
     358 301
     217 404
      33 503

Healthy Googlebot traffic is mostly 200s with some 304s (conditional GET successes) and some 301s (redirect chains being walked). Significant 404 counts mean Googlebot is hitting URLs that no longer exist; check internal linking and old URL handling. Significant 503 counts mean server health issues are degrading crawl rate; investigate server load.

10.3 Most Crawled URLs

The URLs Googlebot crawls most often are diagnostic of what Google considers important on the site:

# Top 20 URLs Googlebot fetches
awk -F'"' '$6 ~ /Googlebot/ {print $2}' /var/log/nginx/access.log \
  | awk '{print $2}' | sort | uniq -c | sort -rn | head -20

If the top URLs are the site's most important pages, crawl budget is being well allocated. If the top URLs are parameter variants, internal search results, or pagination URLs, crawl budget is being wasted on low value content. Section 9.4 covers the faceted nav remediation.

10.4 GSC Crawl Stats Report

The Google Search Console Crawl Stats report (under Settings) is the canonical Google view of crawl activity. It is not a substitute for server logs but it provides the Google specific perspective:

Three diagnostic patterns in the Crawl Stats report:

Crawl rate dropping. Indicates server health issues, robots.txt errors, or Google losing interest in the site. Investigate server response times, recent robots.txt changes, and recent indexing volume.

Discovery percentage rising. Indicates Googlebot is finding new URLs faster than refreshing existing ones, often a signal that new content is being published but not getting indexed. Cross reference with the Pages report's "Discovered but not indexed" bucket.

5xx response rate rising. Server is failing on Googlebot requests. Investigate immediately; sustained 5xx during Googlebot crawls drops pages from the index. Source: Google Search Central, crawl errors documentation, 2024.

10.5 Bingbot Crawl Patterns

Bing Webmaster Tools provides a similar but less detailed view. The Crawl Information report shows crawl request counts and response code distribution. Bingbot patterns differ from Googlebot patterns in important ways:

Source: Bing Webmaster blog, crawl behavior documentation, 2018 to 2024 updates.

10.6 AI Crawler Verification

For AI crawlers, the verification question is whether the crawler is actually visiting the site after being explicitly allowed in robots.txt. The diagnostic:

# Last 30 days of named AI crawler visits
for bot in GPTBot OAI-SearchBot ChatGPT-User ClaudeBot Claude-SearchBot Claude-User PerplexityBot Perplexity-User CCBot Applebot Meta-ExternalAgent; do
  count=$(awk -F'"' -v bot="$bot" '$6 ~ bot' /var/log/nginx/access.log* 2>/dev/null | wc -l)
  echo "${bot}: ${count}"
done

Expected output for a healthy AI optimized site:

GPTBot: 1842
OAI-SearchBot: 421
ChatGPT-User: 89
ClaudeBot: 921
Claude-SearchBot: 312
Claude-User: 47
PerplexityBot: 1655
Perplexity-User: 178
CCBot: 412
Applebot: 289
Meta-ExternalAgent: 234

Zero visits for a bot that should be allowed is a diagnostic. The investigation order:

  1. Verify the robots.txt actually allows the bot. Fetch robots.txt and grep.
  2. Verify the bot is reaching the server. Check firewall logs, CDN logs (if any), DDoS protection rules.
  3. Verify the site is discoverable to the bot. New sites need time and inbound links before AI crawlers find them. Submission via the engine's documented submission channel (where available) accelerates discovery.
  4. Verify no server level rate limiting is dropping the bot. Check nginx error log for connection drops.

10.7 Bot Verification

User agent strings are trivial to spoof. Verifying a crawler claiming to be Googlebot is actually Googlebot requires reverse DNS:

# Reverse DNS of a claimed Googlebot IP
host 66.249.66.1
# Expected: 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

# Forward DNS confirmation
host crawl-66-249-66-1.googlebot.com
# Expected: crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Both directions must resolve. The reverse DNS must end in .googlebot.com or .google.com. Source: Google Search Central, verifying Googlebot documentation, 2024.

For other crawlers:

A user agent claiming to be Googlebot from an IP that does not reverse resolve to .googlebot.com is a hostile crawler masquerading. Server level filtering can drop such traffic without impacting real Googlebot. Cross reference: framework-security.md for hostile bot defense patterns.

10.8 The Three Most Common Crawl Issues

Across the audits I conduct, three patterns account for the majority of crawl issues:

Pattern 1: 5xx spike during peak hours. Server cannot keep up with combined human and crawler load. Solution: server scaling or Crawl-delay for Bingbot (does nothing for Googlebot, which auto adjusts crawl rate when 5xx rate rises).

Pattern 2: Parameter explosion in crawled URLs. Faceted nav, internal search, tracking parameters generating millions of crawled URLs. Solution: Section 9.4 faceted nav remediation.

Pattern 3: Bot blocked at the firewall. A crawler that should be allowed is being dropped at the firewall level (often because an aggressive WAF rule misclassified it as hostile). Solution: explicit allowlist for verified crawler IP ranges in firewall config.


11. llms.txt as Complement to Robots.txt

11.1 What llms.txt Is

llms.txt is a proposed standard for AI engines, designed by Jeremy Howard of Answer.AI and Fast.ai in September 2024. It is a markdown file at the domain root that provides a curated map of the site's most important content for large language models. Source: llmstxt.org specification, September 2024.

Unlike robots.txt, llms.txt is not advisory access control. It is editorial guidance. It says "if you want to understand this site, here are the canonical entry points." It contains links and descriptions, not directives.

Example structure for a small business site:

# ThatDeveloperGuy

> Independent SDVOSB web development and SEO consultancy in Bentonville, Arkansas, specializing in AI search citation optimization, technical SEO, and content first architecture.

## Primary Documentation

- [About](https://thatdeveloperguy.com/about/): Background on Joseph Anady, the founder, and the business
- [Services](https://thatdeveloperguy.com/services/): What the practice offers
- [Editorial Policy](https://thatdeveloperguy.com/editorial-policy/): How content is created and reviewed
- [AI Disclosure](https://thatdeveloperguy.com/disclosure/): How AI tools are used in client work

## Core Topics

- [SEO Frameworks](https://thatdeveloperguy.com/frameworks/): The framework library this site is best known for
- [AI Citation Optimization](https://thatdeveloperguy.com/topics/ai-citation/): Optimizing content for AI search engines
- [Technical SEO](https://thatdeveloperguy.com/topics/technical-seo/): The infrastructure layer of search visibility

## Foundational Frameworks

- [E-E-A-T](https://thatdeveloperguy.com/framework-eeat/)
- [Helpful Content System](https://thatdeveloperguy.com/framework-hcs/)
- [Knowledge Graph](https://thatdeveloperguy.com/framework-knowledgegraph/)
- [AI Citations](https://thatdeveloperguy.com/framework-aicitations/)
- [Robots.txt and Crawl Budget](https://thatdeveloperguy.com/framework-robotstxt-crawlbudget/)

## Contact

For inquiries, contact joseph.w.anady@icloud.com.

The file is human readable and machine readable. It is intentionally simple. The premise is that an LLM ingesting the file at retrieval time can quickly understand what the site is about and where its canonical content lives.

11.2 What llms.txt Does That Robots.txt Does Not

robots.txt controls access. llms.txt provides curation. They solve different problems and complement each other.

Robots.txt says "you may or may not crawl this path." It is enforcement (advisory).

Llms.txt says "if you crawl, here are the pages that matter most." It is curation (editorial).

A site with neither has no editorial guidance to AI engines beyond what they discover on their own. A site with robots.txt but no llms.txt controls access but does not curate. A site with llms.txt but no robots.txt curates without controlling access (almost always a misconfiguration). A site with both is in the strongest position.

The llms-full.txt companion file (also part of the original proposal) is a longer file containing the concatenated full text of the most important pages. It is intended for LLMs that want to ingest substantial site content without traversing many pages.

11.3 Current Adoption Reality

Adoption of llms.txt is non trivial but the effect on AI citation is contested.

A SE Ranking study analyzed approximately 300,000 domains in November 2025 and found 10.13 percent had an llms.txt file. The study's machine learning analysis found no measurable effect of llms.txt presence on AI engine citation frequency. Source: SE Ranking llms.txt adoption analysis, November 2025.

A BuiltWith tracking report from October 25, 2025 identified over 844,000 websites with llms.txt files in place. Adoption was concentrated in developer documentation sites and SEO tooling sites. Source: BuiltWith llms.txt usage statistics, October 2025.

As of mid 2025, no major LLM provider (OpenAI, Anthropic, Google, Perplexity, Meta) had officially announced that they use llms.txt files in production retrieval. Source: industry analysis of LLM provider documentation, 2025.

The operational implication: llms.txt is cheap to publish, may have downstream effects as adoption increases, and has no documented downside. The recommended posture for thatdeveloperguy.com client engagements is to publish a minimal llms.txt and skip llms-full.txt for most clients. The investment is low and the optionality is real.

For deeper guidance on llms.txt and the surrounding AI citation infrastructure, see framework-aicitations.md Section 6.2.

11.4 Robots.txt and llms.txt Coexistence

The two files coexist without interaction. They are at different paths (/robots.txt and /llms.txt), serve different purposes, and have no shared syntax.

A site can disallow GPTBot in robots.txt and still publish an llms.txt. The llms.txt will simply not be retrieved by OpenAI's crawlers (because the robots.txt blocks them). It may still be retrieved by other engines.

A site can allow all AI crawlers in robots.txt and provide no llms.txt. AI engines will crawl the site and form their own model of its content. The llms.txt is a way to influence that model.

The pattern for a site optimizing for AI citation:

# robots.txt: permissive for citation engines (see Section 5.6)

# llms.txt at /llms.txt: curated guide for AI engines

# llms-full.txt at /llms-full.txt: optional concatenated full text

All three files are public, served at the domain root, returning 200 with text/plain (robots.txt) or text/markdown (llms.txt and llms-full.txt).

11.5 Publishing llms.txt on nginx

On a self hosted nginx site, llms.txt is a static file at the document root:

# Author the file
sudo nano /var/www/sites/example.com/public/llms.txt

# Verify it serves
curl -I https://example.com/llms.txt
# Expect: HTTP/2 200, Content-Type: text/markdown or text/plain

Some nginx configurations need a MIME type addition for markdown:

# /etc/nginx/mime.types or in the server block
types {
    text/markdown md;
}

Or in the server block:

location = /llms.txt {
    add_header Content-Type "text/markdown; charset=utf-8";
}

location = /llms-full.txt {
    add_header Content-Type "text/markdown; charset=utf-8";
}

Validate and reload:

sudo nginx -t && sudo systemctl reload nginx

For static site generators (Hugo, Astro, Eleventy, Next.js static export), the llms.txt is a content file in the source repository, built into the output directory like any other static asset. For WordPress, a plugin or theme function generates it from selected pages.


12. Common Robots.txt Mistakes

The ten most consequential anti patterns, ranked by frequency and severity:

12.1 Blocking CSS and JavaScript

The most damaging single mistake. Googlebot needs CSS and JavaScript to render pages. Blocking /wp-content/, /wp-includes/, /assets/, or /static/ because they "are not content" prevents rendering. The page is indexed as the unrendered HTML, which often omits the navigation, schema, and visible content.

# Anti pattern
User-agent: *
Disallow: /wp-content/
Disallow: /wp-includes/

Fix: remove the directives. Modern Googlebot fetches and executes CSS and JS within seconds of the HTML fetch.

12.2 Using Disallow to Prevent Indexing

The misunderstanding from Section 3.1. Disallow blocks crawling; it does not prevent indexing. A URL with inbound links can be indexed without content even when Disallowed.

# Anti pattern: trying to noindex via Disallow
User-agent: *
Disallow: /private-but-linked-from-elsewhere/

Fix: allow crawling, serve noindex via meta tag or X-Robots-Tag HTTP header.

12.3 Case Sensitivity Errors

User agent matching is case insensitive in the protocol. Path matching is case sensitive. The common error is assuming both are case insensitive, leading to rules that do not match the actual URL.

# Anti pattern: case mismatch
User-agent: *
Disallow: /Admin/

# Actual URL on the site
GET /admin/dashboard
# Result: NOT blocked, the case did not match

Fix: match the actual URL case exactly. Or use a wildcard with case insensitive regex (Bingbot and a few others support (?i), Googlebot does not).

12.4 Blocking the Wrong User Agent

Bot user agent strings have changed over time. claude-web was renamed to ClaudeBot. anthropic-ai was deprecated in February 2026. Sites with old robots.txt files often block deprecated names while leaving the current names allowed. The block is meaningless because the new name passes through.

# Anti pattern: outdated bot names
User-agent: claude-web
Disallow: /

User-agent: anthropic-ai
Disallow: /
# But ClaudeBot is not blocked

Fix: maintain the bot list quarterly. Section 14 has the cadence.

12.5 Misplaced or Missing Sitemap Declaration

The Sitemap: directive must be an absolute URL. Relative paths are ignored. The directive must be at the top level of the file, not inside a User-agent group.

# Anti pattern: relative sitemap URL
Sitemap: /sitemap.xml

# Anti pattern: sitemap inside User-agent block (still works in most crawlers but conceptually wrong)
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Fix: absolute URL, top level placement.

12.6 Disallow: / Left Over from Staging

The classic. A deploy pushes the staging robots.txt to production. Within 24 to 72 hours, the site is removed from Google's index. The fix is to deploy the correct robots.txt and request re crawl in Search Console.

# Anti pattern: staging robots.txt on production
User-agent: *
Disallow: /

Fix: separate robots.txt files for staging and production. Deploy scripts that generate the correct file for the target environment. Pre deploy validation that the production robots.txt is not Disallow: /.

12.7 Trailing Slash Mismatches

The path matching rule treats /admin and /admin/ as different patterns. Disallow: /admin matches /admin, /admin/, /administration, /admin-tools. Disallow: /admin/ matches only /admin/, /admin/anything.

# Anti pattern: unintentionally broad disallow
User-agent: *
Disallow: /admin
# Also blocks /administration, /admin-tools, /administrative-leave-policy/

Fix: be deliberate about trailing slashes. End directory disallows with a slash. Use $ end of URL anchor when matching exact paths.

12.8 Conflicting Rules in Group

When the same User-agent group has both Allow and Disallow rules covering the same path, the more specific rule wins (longer matching prefix). If specificity is equal, behavior is implementation defined (Googlebot resolves to Allow, other crawlers may resolve to Disallow).

# Ambiguous: same length specificity
User-agent: *
Allow: /products/
Disallow: /products/

Fix: make rules unambiguous. The more specific rule should be the override; the less specific rule is the broad policy.

12.9 Comments Inside Directives

Comments are introduced with #. They must be at the start of a line or after whitespace following a directive value. Comments inside directive values produce undefined behavior.

# Anti pattern: inline comment inside directive
Disallow: /admin/ # internal admin only

Some parsers treat the entire line as the disallow value, including the # internal admin only text. The directive then fails to match the actual /admin/ path.

Fix: put comments on their own line, before the directive.

# internal admin only
Disallow: /admin/

12.10 The File Is Not at the Root

The robots.txt file must be at /robots.txt of the host. Not /somewhere/else/robots.txt. Not /robots/robots.txt. The root of the host, exactly. Subdomains have their own robots.txt files; the production domain robots.txt does not control the staging subdomain.

# Anti pattern: file at a subdirectory
https://example.com/site-config/robots.txt

Fix: move the file to /robots.txt. For nginx with a static site at /var/www/sites/example.com/public/, the file is at /var/www/sites/example.com/public/robots.txt.

The November 2024 industry audit study (cited in Section 3.3) found that approximately 8 percent of audited sites had a robots.txt at a non root path, effectively meaning the site had no robots.txt at all from a crawler's perspective. Source: industry SEO research, 2024 robots.txt audit study.


13. Audit Rubric

13.1 Per Site Audit

The full audit covers 30 criteria across protocol compliance, AI crawler posture, sitemap integration, defense layer, and operational health.

# Criterion Pass / Fail
RC1 robots.txt returns HTTP 200 with text/plain content type
RC2 robots.txt is at /robots.txt of the production host, not a subdirectory
RC3 File size under 500 KB (Google truncation limit)
RC4 robots.txt does NOT block CSS or JavaScript directories
RC5 robots.txt does NOT contain Disallow: / for the default * group
RC6 robots.txt does NOT use noindex directive (unsupported by Google)
RC7 Sitemap: directive present with absolute URL
RC8 Sitemap URL returns 200 and parses as valid XML
RC9 All declared sitemaps are accurate and current
RC10 AI crawler posture documented in client variables
RC11 GPTBot, OAI-SearchBot, ChatGPT-User posture explicit
RC12 ClaudeBot, Claude-SearchBot, Claude-User posture explicit
RC13 PerplexityBot, Perplexity-User posture explicit
RC14 Google-Extended and Applebot-Extended posture explicit
RC15 Meta-ExternalAgent and CCBot posture explicit
RC16 Bytespider and other aggressive crawlers handled (block or rate limit)
RC17 Staging subdomain has its own robots.txt with Disallow: /
RC18 Staging subdomain protected by basic auth or IP allowlist
RC19 Admin paths protected by basic auth at nginx level (not relying on Disallow)
RC20 Tracking parameters (utm_, fbclid, gclid, sessionid) disallowed
RC21 GSC Crawl Stats report reviewed in last 90 days
RC22 Server logs analyzed for crawler activity in last 90 days
RC23 Named AI crawlers verified visiting the site (or block reason documented)
RC24 No 5xx response spikes in server logs in last 30 days
RC25 Faceted nav (if present) handled per Section 9.4 pattern
RC26 Crawl budget assessment complete (URL count vs threshold)
RC27 llms.txt published (if engagement covers AI citation)
RC28 Bot verification logic in place (no UA spoofing scrapers treated as bots)
RC29 robots.txt reviewed for new AI crawlers in last 90 days
RC30 No conflicting Allow and Disallow rules at equal specificity

Score: 30. World class: 27 or higher. Acceptable: 24 to 26. Below 24 indicates a meaningful operational gap.

13.2 First 90 Days Subset

For a new engagement, the priority subset to clear in the first 90 days:

# Priority Criterion
1 Day 1 RC5: file does not contain Disallow: / for the default group
2 Day 1 RC4: file does not block CSS or JS
3 Day 1 RC1: file returns HTTP 200
4 Day 1 RC2: file is at the host root
5 Week 1 RC7, RC8, RC9: sitemap declared and current
6 Week 1 RC10, RC11, RC12, RC13: AI crawler posture documented and applied
7 Week 2 RC17, RC18: staging environment defended
8 Week 2 RC19: admin paths protected by basic auth
9 Week 4 RC22, RC23: server log baseline established
10 Week 4 RC21: GSC Crawl Stats baseline established
11 Month 2 RC25, RC26: crawl budget assessment if applicable
12 Month 2 RC27: llms.txt published
13 Month 3 RC29: AI crawler list refresh
14 Month 3 RC30: directive consistency review

The first day items are emergency level. The rest is operational hygiene.


14. Maintenance Schedule and Report Templates

14.1 Cadence

Weekly. Spot check robots.txt is unchanged from intended state. Check GSC Crawl Stats for sudden changes. Confirm no 5xx spikes in server log.

Monthly. Review server log for AI crawler activity. Verify the crawlers that should be visiting are visiting. Review Crawl Stats trends.

Quarterly. Full audit per Section 13. Refresh the AI crawler bot list (new bots emerge multiple times per year). Update llms.txt if site content has expanded. Review crawl budget assessment if site has grown.

Annually. Comprehensive robots.txt rebuild from scratch. Validate every directive against the current bot list. Re evaluate editorial posture (does the business want different access for new content categories). Refresh basic auth credentials for admin paths.

On Major AI Engine Announcement. When OpenAI, Anthropic, Google, Perplexity, or Meta announces a new crawler or a posture change, evaluate the impact within one week. Add or remove user agents as appropriate.

14.2 Implementation Report Template

# Robots.txt and Crawl Budget Implementation Report

Site: {{BUSINESS_NAME}}
Implementation Date: {{TODAY}}

## Summary

- `robots.txt` audited and rewritten: {{YES/NO}}
- Sitemap declaration verified: {{YES/NO}}
- AI crawler posture established: {{YES/NO}}
- Staging defense layer applied: {{YES/NO}}
- `llms.txt` published: {{YES/NO}}

## Robots.txt Posture

- Pattern selected: {{permissive | restrictive | citation_optimized}}
- AI crawlers allowed: {{COUNT}} of {{TOTAL_NAMED_BOTS}}
- AI crawlers disallowed: {{COUNT}}
- SEO tool crawlers disallowed: {{COUNT}}
- Default group: {{Allow | Disallow}}

## Sitemap

- Sitemap URL: {{ABSOLUTE_URL}}
- Sitemap URL count: {{COUNT}}
- Sitemap returns 200: {{YES/NO}}
- Sitemap submitted to GSC: {{YES/NO}}
- Sitemap submitted to Bing Webmaster: {{YES/NO}}

## Defense Layer

- Admin paths protected by basic auth: {{YES/NO}}
- Staging subdomain blocked from indexing: {{YES/NO}}
- Staging subdomain IP allowlisted: {{YES/NO}}
- `X-Robots-Tag` applied to non HTML resources: {{YES/NO}}

## Crawl Budget Assessment

- Total indexable URL count: {{NUMBER}}
- Crawl budget priority: {{not_a_concern | modest | real | critical}}
- Faceted nav present: {{YES/NO}}
- Parameter handling pattern applied: {{YES/NO}}

## Sign Off

14.3 Audit Report Template

# Robots.txt and Crawl Budget Audit Report

Site: {{BUSINESS_NAME}}
Audit Date: {{TODAY}}

## Executive Summary

{{ONE_PARAGRAPH_ASSESSMENT}}

## Site Score

{{X}}/30 per Section 13 rubric.

## Critical Failures

{{LIST_OF_FAILED_RUBRIC_ITEMS_PRIORITY_ORDERED}}

## Robots.txt Posture Analysis

- Pattern in use: {{description}}
- AI crawler posture: {{summary}}
- Documented editorial reasoning: {{YES/NO}}

## Sitemap Integration

- Declared sitemaps: {{LIST}}
- Sitemap validity: {{summary}}
- Coverage match between sitemap and indexable URLs: {{summary}}

## Server Log Findings

- Top crawlers by request volume: {{LIST}}
- AI crawler visit rates: {{LIST}}
- Anomalies detected: {{LIST}}

## GSC Crawl Stats Findings

- Crawl rate trend: {{summary}}
- 5xx rate: {{summary}}
- Discovery vs refresh ratio: {{summary}}

## Crawl Budget Assessment

- URL count: {{NUMBER}}
- Threshold tier: {{description}}
- Faceted nav status: {{summary}}

## llms.txt Status

- Present: {{YES/NO}}
- Coverage of canonical content: {{summary}}

## Recommended Remediation Order

{{PRIORITIZED_LIST}}

## Trend vs Previous Audit

{{COMPARISON_IF_APPLICABLE}}

## Sign Off

End of Framework Document

Document version: 1.0 Last updated: 2026-05-14 Maintained by: Joseph W. Anady, ThatDeveloperGuy, SDVOSB

The robots.txt file at the root of a domain is the single most consequential 500 bytes on a website. It controls bot access, declares sitemaps, and codifies the editorial posture toward the 30 plus named AI crawlers that now define visibility on search and answer surfaces. Misconfigured, it removes a site from search results within days. Correctly configured, it is invisible. The asymmetry is the whole reason this framework exists.

The crawl budget concept matters for a minority of sites, but for those sites the diagnostic discipline matters a great deal. Server log analysis is the canonical instrument. GSC Crawl Stats is the second instrument. Faceted navigation is the primary cause of waste. Section 9 and Section 10 together are the playbook for the cases where this matters.

The llms.txt complement is inexpensive optionality. Adoption is approximately 10 percent of sites and growing. The downstream effect on AI citation is contested. The recommended posture is to publish a minimal file and skip the larger llms-full.txt for most clients. If AI engines do start to use llms.txt at scale, the site is positioned. If they do not, the cost was a single markdown file.

Companion documents (Tier reference, see framework-masterindex.md for the complete index):

Want this framework implemented on your site?

ThatDevPro ships these frameworks as productized services. SDVOSB-certified veteran owned. Cassville, Missouri.

See Engine Optimization service ›