SEO & AI Engine Optimization Framework · May 2026

Technical SEO: crawling, indexing, canonicalization, JS rendering

A comprehensive installation and audit reference for technical SEO — the bedrock layer that determines whether search engines and AI crawlers can discover, render, and index a site at all. Every…

Crawlability, Indexing, Canonicalization, Redirects, URL Structure, and the Bot-Facing Foundation

A comprehensive installation and audit reference for technical SEO — the bedrock layer that determines whether search engines and AI crawlers can discover, render, and index a site at all. Every other framework in this library assumes the technical foundation works. This document specifies what "works" means and how to verify it. Dual-purpose: installation manual and audit document.

Cross-stack implementation note: the code samples in this framework are written in plain HTML for clarity. For React, Vue, Svelte, Next.js, Nuxt, SvelteKit, Astro, Hugo, 11ty, Remix, WordPress, Shopify, and Webflow equivalents of every pattern below, see framework-cross-stack-implementation.md. For pure client-rendered SPAs (no SSR/SSG) see framework-react.md. For Tailwind-specific concerns (purge, dynamic classes, dark-mode CLS, focus accessibility) see framework-tailwind.md.


1. Document Purpose

This is the canonical reference for technical SEO. Content quality, authority signals, schema markup, and AI optimization are all wasted if a crawler cannot reach a page, cannot render it, or cannot decide which version to index. Technical SEO is the prerequisite. It is not glamorous. It is not optional.

In 2026, technical SEO has changed in three ways since 2020. First, JavaScript rendering is no longer the bleeding-edge concern — Google renders JS reliably and other major crawlers (Bingbot, GPTBot, ClaudeBot, PerplexityBot, Applebot) have caught up to varying degrees. Second, the bot landscape exploded: a 2026 site receives traffic from a dozen named AI crawlers in addition to the four major search engines, and robots.txt policy is now an editorial decision, not a technical default. Third, indexing has become more selective — Google indexes a smaller percentage of crawled URLs than it did a decade ago, so wasting crawl budget on duplicate, parameterized, or thin URLs has direct cost.

1.1 Required Tools

1.2 Document Scope

Covers: crawl access, robots.txt, XML sitemaps, canonicalization, redirects, URL structure, status codes, JS rendering, mobile-first indexing, HTTPS posture, hreflang, and crawler observability. Touches but does not exhaust: page experience (own framework: framework-pageexperience.md), schema (framework-schema.md), internal linking (framework-internallinking.md), security (framework-security.md).


2. Client Variables Intake

domain_apex: ""
www_or_non_www_canonical: ""           # decide which is canonical
http_or_https: "https"                 # always https in 2026
trailing_slash_policy: ""              # with-slash | without-slash
url_case_policy: "lowercase"
cms_or_framework: ""                   # WordPress | Next.js | Astro | Hugo | Shopify | Webflow | static
hosting_environment: ""
cdn: ""                                # Cloudflare | Fastly | none
search_console_verified: false
bing_webmaster_verified: false
indexnow_key_deployed: false
known_indexing_issues: []
recent_migrations: []
international_targets: []              # if any hreflang need

3. Crawl Access Layer

3.1 robots.txt

The robots.txt file at the domain root tells crawlers which paths they may request. It is advisory, not a security mechanism — anything genuinely sensitive belongs behind authentication, not behind a Disallow rule.

Minimum viable robots.txt for a production site:

User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /staging/
Disallow: /*?*sessionid=
Disallow: /*?*utm_*

Sitemap: https://example.com/sitemap.xml

Required validations:

AI crawler policy (2026 baseline):

In 2026, the question is no longer "do we block AI crawlers" but "which AI crawlers do we want citing us, and which do we want to block." A typical client-facing posture:

# AI search crawlers — usually allow (citation traffic)
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

# Aggressive scrapers — block by default unless client requests otherwise
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12Bot
Disallow: /

Cross-reference: framework-aicitations.md for the full AI-crawler matrix.

3.2 X-Robots-Tag

For non-HTML resources (PDFs, images, JSON files) that should not be indexed, X-Robots-Tag is the only signal. Set via server config:

# nginx
location ~* \.(pdf|json)$ {
  add_header X-Robots-Tag "noindex, nofollow";
}

For HTML pages, <meta name="robots"> is preferred because it is more visible to humans editing pages.

3.3 Crawl Budget

Crawl budget is the number of URLs Googlebot crawls in a given period. For sites under ~50,000 URLs this rarely matters. Above that, crawl-budget waste shows as:

Crawl budget conservation:


4. Indexing Layer

4.1 The Two-Step Process

Indexing is two steps, not one:

  1. Discovery and crawl — the bot finds the URL (sitemap, link, IndexNow ping) and requests it.
  2. Indexing — the bot parses the response, decides whether to add it to the index, and what to associate with it.

A page can be crawled but not indexed. A page can be in the index but not ranked. These are distinct states with distinct fixes.

4.2 Index Status in Google Search Console

Use Coverage (now Pages) report. The buckets:

Status Meaning Action
Indexed In Google's index Monitor for unexpected changes
Indexed, not submitted in sitemap Found via links, not in sitemap Add to sitemap if it should be indexed
Crawled — currently not indexed Crawled but Google chose not to index Improve content quality, add internal links, check duplication
Discovered — currently not indexed Found but not yet crawled Often crawl-budget signal; check site authority + reduce low-value URLs
Excluded by 'noindex' tag Intentionally excluded Verify intent
Page with redirect Redirected; not indexed itself Verify the redirect target is indexed
Duplicate, Google chose different canonical Google ignored your canonical Check internal linking + canonical signals consistency
Soft 404 Returns 200 but content looks like a 404 page Either return real 404 or fix the page
Server error (5xx) Crawl failed Fix the server error

4.3 IndexNow

IndexNow is a push-based indexing protocol supported by Bing, Yandex, Seznam, and Naver (not Google). When a URL changes, you POST it to IndexNow and supported engines crawl within minutes instead of days.

Implementation:

  1. Generate an API key (random 32-character string)
  2. Place the key as /{key}.txt at the domain root with the key as its content
  3. POST URL changes to https://api.indexnow.org/indexnow:
POST /indexnow HTTP/1.1
Host: api.indexnow.org
Content-Type: application/json

{
  "host": "example.com",
  "key": "abc123...",
  "keyLocation": "https://example.com/abc123.txt",
  "urlList": [
    "https://example.com/new-page/",
    "https://example.com/updated-page/"
  ]
}

For WordPress, plugin support exists. For Next.js / static sites, integrate as a build-time hook.

4.4 Mobile-First Indexing

Since 2023, Google indexes the mobile version of every site. The desktop version is largely ignored for ranking. Verify:


5. Canonicalization

5.1 Why Canonicalization Matters

Modern sites generate many URLs that resolve to the same content:

Without explicit canonical signals, Google picks one and may not pick the one you want. Ranking signals split across variants. Indexing decisions become inconsistent.

5.2 Canonical Signal Stack

Canonical signals reinforce each other. Use all of them:

  1. <link rel="canonical" href="..."> — the explicit declaration. Self-referential on the canonical URL itself.
  2. 301 redirects — for true duplicates, redirect non-canonical to canonical (preferred over rel=canonical when content is genuinely identical).
  3. Internal linking — every internal link points to the canonical URL, never to a redirected variant.
  4. XML sitemap — only canonical URLs appear in the sitemap.
  5. hreflang annotations — when present, must reference canonical URLs only.
  6. HTTP Link header — equivalent to rel=canonical, used for non-HTML resources.

If these signals disagree, Google picks one and ignores the rest. Consistency is the rule.

5.3 Common Canonicalization Mistakes

5.4 Trailing Slash and Case

Pick one and enforce it sitewide via 301 redirect. Do not rely on canonicals alone — redirects collapse the duplication, canonicals only signal it.

# nginx — enforce trailing slash + lowercase
rewrite ^/(.*[A-Z]+.*)$ /$1 permanent;  # would need lua/regex helper
location / {
  try_files $uri $uri/ =404;
}

For Next.js, set trailingSlash: true (or false) in next.config.js and stick with it. Mixing breaks canonicalization.


6. Redirects

6.1 Status Codes

Code Use case
301 Moved Permanently The URL has permanently moved. Passes ranking signals. Default for migrations.
302 Found Temporary redirect. Use only for actual temporary moves (A/B tests, seasonal pages). Misused 302s leak link equity.
307 Temporary Redirect Like 302 but preserves request method. Rare in SEO context.
308 Permanent Redirect Like 301 but preserves request method. Functionally equivalent for SEO.
410 Gone Page is permanently removed and not coming back. Faster removal from index than 404.
451 Unavailable for Legal Reasons Use when content removed for legal reasons (DMCA, jurisdictional).

6.2 Redirect Chains

A chain is A → B → C. Eliminate them. Every redirect should be a single hop to the final URL. Chains:

Maintain a redirect map spreadsheet for any migration. After deploying redirects, crawl the site and verify zero chains.

6.3 Redirect Implementation Layers

In order of preference:

  1. Server config (nginx, Apache, Cloudflare Rules) — fastest, most reliable, executed before page load.
  2. CMS-level redirect plugin — fine for low-volume changes, performance penalty at scale.
  3. JavaScript redirects — last resort. Slow, fragile, sometimes ignored by crawlers.
  4. Meta refresh redirects — never use. Treated as low-quality signal.

6.4 The Migration Redirect Pattern

When migrating URL structure:

  1. Generate a 1:1 map of old URL → new URL for every indexed page.
  2. Implement 301s in server config before deploy.
  3. Update internal links to point to new URLs (do not rely on the redirect).
  4. Update XML sitemap to list only new URLs.
  5. Submit new sitemap in GSC.
  6. Monitor GSC's Coverage report for 30-90 days.
  7. Keep redirects in place permanently — old URLs have inbound links from sites you don't control.

Cross-reference: framework-migration.md for full migration methodology.


7. URL Structure

7.1 The Eight URL Rules

  1. Lowercase. Always.
  2. Hyphens between words. Not underscores. Not camelCase.
  3. Under 60 characters when possible. Long URLs index fine but truncate in SERPs.
  4. No stop words unless meaningful. /the-best-web-hosting/ reads better as /best-web-hosting/.
  5. Descriptive, not numeric. /blog/post-1234/ is opaque; /blog/local-seo-checklist/ is meaningful.
  6. No file extensions where avoidable. /about/ over /about.html or /about.php.
  7. One canonical separator policy. Don't mix /category/post/ and /category-post/.
  8. Stable. Once published, do not change a URL without redirecting.

7.2 URL Hierarchy and Crawl Depth

A URL's path depth (/a/b/c/d/page/) is independent of crawl depth (clicks from homepage). Crawl depth matters for SEO; path depth is only loosely related.

Target: every important page reachable in 3 clicks or fewer from the homepage. Verify with Sitebulb's Crawl Depth report.

Cross-reference: framework-internallinking.md for hub-and-spoke architecture.

7.3 Parameter Handling

URL parameters create duplication. Strategies:


8. XML Sitemaps

8.1 What Belongs in a Sitemap

Only canonical, indexable, 200-status URLs that you want indexed. Everything else stays out:

8.2 Sitemap Structure

For sites under 50,000 URLs, a single sitemap is fine. Above that, use a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-images.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
</sitemapindex>

Per-sitemap entry:

<url>
  <loc>https://example.com/page/</loc>
  <lastmod>2026-05-05</lastmod>
</url>

<changefreq> and <priority> are ignored by Google. <lastmod> is honored when accurate; if you fake it (always = today), Google starts ignoring it.

8.3 Specialized Sitemaps

8.4 Submission


9. Status Codes — The Operator's Reference

Beyond redirects, status codes communicate site health to crawlers. The full inventory worth knowing:

9.1 2xx Success

9.2 3xx Redirection

9.3 4xx Client Errors

9.4 5xx Server Errors

5xx codes are urgent. Sustained 5xx during a Googlebot crawl drops pages from the index.


10. JavaScript Rendering

10.1 The Two-Wave Indexing Model (Mostly Obsolete)

For years, Google's two-wave model meant JS sites were indexed late: the first wave indexed HTML, the second wave indexed rendered content days later. As of 2025, Google renders the vast majority of pages within hours of the first crawl. The two-wave problem is no longer a structural blocker for most sites.

It is still real for AI crawlers. GPTBot, ClaudeBot, and PerplexityBot do not all render JS reliably. For AI search visibility, server-side rendered content matters more than for traditional Google SEO.

10.2 Rendering Strategy by Content Type

Content type Recommended rendering
Marketing pages, landing pages SSG or SSR (no client-side rendering for primary content)
Blog posts, articles SSG
Ecommerce product pages SSR with hydration
Logged-in user pages CSR is fine (these aren't indexed anyway)
Real-time data displays SSR shell + client hydration

10.3 Validation

Use Google's Mobile-Friendly Test or URL Inspection tool's "Test Live URL" to see what Googlebot actually renders. If primary content is missing, the page is not effectively indexed even if it returns 200.

For AI crawler visibility, curl -A "GPTBot" https://example.com/page and inspect the HTML response body. If the content is in <noscript> only or arrives via fetch/XHR, AI crawlers miss it.

Cross-reference: framework-headless.md and framework-nextjs.md for framework-specific rendering patterns.


11. HTTPS

In 2026, HTTPS is non-negotiable. HTTP-only sites suffer ranking penalties, browser warnings, and lost trust signals.

11.1 Required Configuration

11.2 Certificate Maintenance

Cross-reference: framework-security.md for broader security posture.


12. International / hreflang

For sites targeting multiple languages or regions, hreflang annotations tell Google which version to show which user.

12.1 hreflang Implementation

Three valid placement methods, listed in order of preference:

  1. HTTP Link header — non-HTML resources, full programmatic control.
  2. XML sitemap hreflang — preferred for large sites; centralizes all annotations in one place.
  3. <link rel="alternate"> tags in <head> — most common; works but harder to maintain at scale.
<link rel="alternate" hreflang="en-US" href="https://example.com/en-us/page/">
<link rel="alternate" hreflang="en-GB" href="https://example.com/en-gb/page/">
<link rel="alternate" hreflang="x-default" href="https://example.com/page/">

12.2 hreflang Rules

Cross-reference: framework-international.md for full hreflang depth.


13. Crawler Observability

Knowing what crawlers actually do on a site, not what you think they do.

13.1 Server Log Analysis

The single best technical SEO data source. Server logs (nginx access logs, Apache access logs, Cloudflare logs) record every bot request with status, response time, and user agent.

Tools:

What to look for:

13.2 Bot Verification

Anyone can claim to be Googlebot. Verify:

For Bing: bingbot.com reverse DNS. For Apple: applebot.apple.com. AI crawlers vary; ClaudeBot publishes IP ranges, GPTBot publishes IP ranges, PerplexityBot publishes IP ranges.

13.3 GSC URL Inspection

For specific URLs, GSC's URL Inspection tool shows:

Use this to debug specific indexing problems.


14. Audit Mode

# Criterion Pass/Fail
TS1 robots.txt returns 200 plain text, allows critical resources
TS2 XML sitemap returns 200, validates, lists only canonical indexable URLs
TS3 Sitemap submitted to Google Search Console and Bing Webmaster
TS4 All canonical URLs return 200
TS5 Self-referential rel=canonical on every indexable page
TS6 HTTP-to-HTTPS 301 redirect, single hop
TS7 www / non-www unified via 301, single hop
TS8 Trailing slash policy enforced sitewide
TS9 URLs lowercase, no mixed-case duplicates
TS10 Zero redirect chains (every redirect single-hop)
TS11 No 4xx URLs in sitemap or internal links
TS12 No 5xx URLs detected in last 30 days
TS13 HSTS header present with min 1-year max-age
TS14 TLS 1.2+ enforced, valid certificate, no mixed content
TS15 Mobile rendering verified (Mobile-Friendly Test)
TS16 JS-rendered content visible to Googlebot via URL Inspection
TS17 IndexNow key deployed (for Bing/Yandex/Naver indexing)
TS18 hreflang correctly implemented if multi-region
TS19 Bot verification logic in place (no UA-spoofing scrapers treated as bots)
TS20 Server logs analyzed at least quarterly for crawl-budget waste
TS21 Zero soft 404s in GSC
TS22 Crawled — currently not indexed bucket under 10% of total
TS23 Discovered — currently not indexed bucket under 5% of total
TS24 URL parameter strategy documented (block / canonical / index per parameter)
TS25 Pagination strategy documented (rel=next/prev replaced or supplemented)
TS26 AI crawler policy in robots.txt explicit and documented
TS27 Duplicate-content audit completed in last 90 days
TS28 All 301 redirects retained from migrations (don't expire redirects)
TS29 Crawl depth report shows zero pages over depth 3 (small sites) or 5 (large)
TS30 URLs under 60 characters where possible
TS31 Server response time under 600ms for HTML responses
TS32 GSC Coverage report shows zero "server error" URLs
TS33 GSC URL Inspection on 5 random pages confirms canonical, indexed status, no rendering issues
TS34 Lighthouse SEO score 100 on representative sample of pages
TS35 No JavaScript-only navigation (every link reachable via crawl without rendering)

Score: 35. World-class: 33+/35.


15. Common Mistakes

  1. Blocking CSS/JS in robots.txt — Google needs them to render. Frequently breaks indexing.
  2. Canonical pointing to a redirect target — invalidates the canonical signal.
  3. Multiple canonical signals disagreeing — internal links say A, sitemap says B, rel=canonical says C; Google picks one and ignores the others.
  4. Trailing slash inconsistency — half the site with /, half without; treated as duplicate URLs.
  5. Redirect chains — A→B→C→D wastes crawl, leaks signals.
  6. 302s where 301 belongs — temporary redirect on a permanent move; ranking signals leak.
  7. Soft 404s — page returns 200 but says "not found"; Google detects and demotes.
  8. Indexable thin pages — tag archives, paginated category pages, search result pages with no value indexed without filtering.
  9. JavaScript navigation only — links rendered via JS; crawl depth report shows orphaned pages.
  10. Stale <lastmod> in sitemap — every URL claims today's date; Google starts ignoring lastmod entirely.
  11. HTTPS deployed but HTTP not redirected — both versions serve, both indexed, duplicate-content penalty.
  12. Mixed content — HTTPS page loads HTTP resources; browser blocks, layout breaks, ranking suffers.
  13. AI crawler blocking by accident — wildcard Disallow: / block applied to legitimate AI crawlers losing citation traffic.
  14. No IndexNow — Bing, Yandex, Naver indexing days late when push-based could do it in minutes.
  15. Forgotten redirects after migration — old redirects removed prematurely; old inbound links 404.

16. Maintenance

Weekly:

Monthly:

Quarterly:

Annually:


17. Companion Documents


Document version: 1.0 Last updated: 2026-05-05 Owner: Joseph W. Anady — ThatDeveloperGuy — SDVOSB

Want this framework implemented on your site?

ThatDevPro ships these frameworks as productized services. SDVOSB-certified veteran owned. Cassville, Missouri.

See Engine Optimization service ›