SEO & AI Engine Optimization Framework · May 2026

Technical SEO: crawling, indexing, canonicalization, JS rendering

By Joseph W. Anady — Founder & Lead Engineer, ThatDevPro (BA Computer Engineering, MA Cybersecurity) · Updated May 2026

Technical SEO covers the back-end factors that let search engines crawl, render, and index a site: clean URL architecture, robots.txt and sitemaps, fast Core Web Vitals, structured data, and HTTPS. It's the foundation that lets your content and links actually rank.

Crawlability, Indexing, Canonicalization, Redirects, URL Structure, and the Bot-Facing Foundation

A comprehensive installation and audit reference for technical SEO — the bedrock layer that determines whether search engines and AI crawlers can discover, render, and index a site at all. Every other framework in this library assumes the technical foundation works. This document specifies what "works" means and how to verify it. Dual-purpose: installation manual and audit document.

Cross-stack implementation note: the code samples in this framework are written in plain HTML for clarity. For React, Vue, Svelte, Next.js, Nuxt, SvelteKit, Astro, Hugo, 11ty, Remix, WordPress, Shopify, and Webflow equivalents of every pattern below, see framework-cross-stack-implementation.md. For pure client-rendered SPAs (no SSR/SSG) see framework-react.md. For Tailwind-specific concerns (purge, dynamic classes, dark-mode CLS, focus accessibility) see framework-tailwind.md.

Quick answer

Technical SEO covers the back-end factors that let search engines and AI crawlers discover, render, and index a site: clean URL architecture, robots.txt and XML sitemaps, canonicalization, redirects, fast Core Web Vitals, structured data, and HTTPS. It is the foundation that lets content and links actually rank. This page is a dual-purpose installation manual and audit document built around 35 pass/fail checks.

1. Document Purpose

This is the canonical reference for technical SEO. Content quality, authority signals, schema markup, and AI optimization are all wasted if a crawler cannot reach a page, cannot render it, or cannot decide which version to index. Technical SEO is the prerequisite. It is not glamorous. It is not optional.

In 2026, technical SEO has changed in three ways since 2020. First, JavaScript rendering is no longer the bleeding-edge concern — Google renders JS reliably and other major crawlers (Bingbot, GPTBot, ClaudeBot, PerplexityBot, Applebot) have caught up to varying degrees. Second, the bot landscape exploded: a 2026 site receives traffic from a dozen named AI crawlers in addition to the four major search engines, and robots.txt policy is now an editorial decision, not a technical default. Third, indexing has become more selective — Google indexes a smaller percentage of crawled URLs than it did a decade ago, so wasting crawl budget on duplicate, parameterized, or thin URLs has direct cost.

1.1 Required Tools

Google Search Console — search.google.com/search-console — coverage, sitemaps, URL inspection
Bing Webmaster Tools — www.bing.com/webmasters — Bing-specific coverage and IndexNow submission
Screaming Frog SEO Spider — desktop crawler, free up to 500 URLs, paid for unlimited
Sitebulb — desktop crawler, alternative to Screaming Frog with stronger reporting
Ahrefs Site Audit / Semrush Site Audit — cloud-based crawlers with historical tracking
Google Rich Results Test — search.google.com/test/rich-results — render + schema validation
Google Lighthouse — Chrome DevTools performance/SEO audit
GTmetrix / WebPageTest — performance and waterfall analysis
curl / httpie — manual header inspection
Cloudflare / nginx access logs — server-level crawl observation
IndexNow — www.indexnow.org — push-based indexing for Bing, Yandex, Naver

1.2 Document Scope

Covers: crawl access, robots.txt, XML sitemaps, canonicalization, redirects, URL structure, status codes, JS rendering, mobile-first indexing, HTTPS posture, hreflang, and crawler observability. Touches but does not exhaust: page experience (own framework: framework-pageexperience.md), schema (framework-schema.md), internal linking (framework-internallinking.md), security (framework-security.md).

2. Client Variables Intake

domain_apex: ""
www_or_non_www_canonical: ""           # decide which is canonical
http_or_https: "https"                 # always https in 2026
trailing_slash_policy: ""              # with-slash | without-slash
url_case_policy: "lowercase"
cms_or_framework: ""                   # WordPress | Next.js | Astro | Hugo | Shopify | Webflow | static
hosting_environment: ""
cdn: ""                                # Cloudflare | Fastly | none
search_console_verified: false
bing_webmaster_verified: false
indexnow_key_deployed: false
known_indexing_issues: []
recent_migrations: []
international_targets: []              # if any hreflang need

3. Crawl Access Layer

3.1 robots.txt

The robots.txt file at the domain root tells crawlers which paths they may request. It is advisory, not a security mechanism — anything genuinely sensitive belongs behind authentication, not behind a Disallow rule.

Minimum viable robots.txt for a production site:

User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /staging/
Disallow: /*?*sessionid=
Disallow: /*?*utm_*

Sitemap: https://example.com/sitemap.xml

Required validations:

The file is served at /robots.txt with Content-Type: text/plain and HTTP 200.
It does NOT block CSS, JS, or image directories. Googlebot needs those to render.
It does NOT block fonts (/fonts/, /assets/fonts/) or web manifest assets.
The Sitemap directive uses an absolute URL.
Wildcards (*) are used sparingly and tested in GSC's robots tester before deploy.

AI crawler policy (2026 baseline):

In 2026, the question is no longer "do we block AI crawlers" but "which AI crawlers do we want citing us, and which do we want to block." A typical client-facing posture:

# AI search crawlers — usually allow (citation traffic)
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

# Aggressive scrapers — block by default unless client requests otherwise
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12Bot
Disallow: /

Cross-reference: framework-aicitations.md for the full AI-crawler matrix.

3.2 X-Robots-Tag

For non-HTML resources (PDFs, images, JSON files) that should not be indexed, X-Robots-Tag is the only signal. Set via server config:

# nginx
location ~* \.(pdf|json)$ {
  add_header X-Robots-Tag "noindex, nofollow";
}

For HTML pages, <meta name="robots"> is preferred because it is more visible to humans editing pages.

3.3 Crawl Budget

Crawl budget is the number of URLs Googlebot crawls in a given period. For sites under ~50,000 URLs this rarely matters. Above that, crawl-budget waste shows as:

Many URLs in GSC's "Crawled — currently not indexed"
Many URLs in "Discovered — currently not indexed"
Slow propagation of new content to the index
Frequent crawls of low-value parameterized URLs

Crawl budget conservation:

Block parameterized/sessioned URLs in robots.txt
410 truly dead URLs (faster than 404 to be removed from crawl)
Reduce internal links to low-value pages
Use noindex on pages that should not be indexed (404 templates, search result pages, tag archives with thin content)
Use a clean XML sitemap so the crawler has a prioritized list

4. Indexing Layer

4.1 The Two-Step Process

Indexing is two steps, not one:

Discovery and crawl — the bot finds the URL (sitemap, link, IndexNow ping) and requests it.
Indexing — the bot parses the response, decides whether to add it to the index, and what to associate with it.

A page can be crawled but not indexed. A page can be in the index but not ranked. These are distinct states with distinct fixes.

4.2 Index Status in Google Search Console

Use Coverage (now Pages) report. The buckets:

Status	Meaning	Action
Indexed	In Google's index	Monitor for unexpected changes
Indexed, not submitted in sitemap	Found via links, not in sitemap	Add to sitemap if it should be indexed
Crawled — currently not indexed	Crawled but Google chose not to index	Improve content quality, add internal links, check duplication
Discovered — currently not indexed	Found but not yet crawled	Often crawl-budget signal; check site authority + reduce low-value URLs
Excluded by 'noindex' tag	Intentionally excluded	Verify intent
Page with redirect	Redirected; not indexed itself	Verify the redirect target is indexed
Duplicate, Google chose different canonical	Google ignored your canonical	Check internal linking + canonical signals consistency
Soft 404	Returns 200 but content looks like a 404 page	Either return real 404 or fix the page
Server error (5xx)	Crawl failed	Fix the server error

4.3 IndexNow

IndexNow is a push-based indexing protocol supported by Bing, Yandex, Seznam, and Naver (not Google). When a URL changes, you POST it to IndexNow and supported engines crawl within minutes instead of days.

Implementation:

Generate an API key (random 32-character string)
Place the key as /{key}.txt at the domain root with the key as its content
POST URL changes to https://api.indexnow.org/indexnow:

POST /indexnow HTTP/1.1
Host: api.indexnow.org
Content-Type: application/json

{
  "host": "example.com",
  "key": "abc123...",
  "keyLocation": "https://example.com/abc123.txt",
  "urlList": [
    "https://example.com/new-page/",
    "https://example.com/updated-page/"
  ]
}

For WordPress, plugin support exists. For Next.js / static sites, integrate as a build-time hook.

4.4 Mobile-First Indexing

Since 2023, Google indexes the mobile version of every site. The desktop version is largely ignored for ranking. Verify:

All content visible on mobile (not hidden behind "click to expand" with mobile-only display:none)
Structured data identical on mobile and desktop
Meta tags identical on mobile and desktop
Internal linking identical (no mobile-only menu omitting key links)
Images load on mobile (no desktop-only assets)

5. Canonicalization

5.1 Why Canonicalization Matters

Modern sites generate many URLs that resolve to the same content:

https://example.com/page and https://example.com/page/
https://example.com/page and https://www.example.com/page
https://example.com/page?utm_source=email
https://example.com/PAGE (some servers serve the same content for any case)
https://example.com/page?session=abc123

Without explicit canonical signals, Google picks one and may not pick the one you want. Ranking signals split across variants. Indexing decisions become inconsistent.

5.2 Canonical Signal Stack

Canonical signals reinforce each other. Use all of them:

<link rel="canonical" href="..."> — the explicit declaration. Self-referential on the canonical URL itself.
301 redirects — for true duplicates, redirect non-canonical to canonical (preferred over rel=canonical when content is genuinely identical).
Internal linking — every internal link points to the canonical URL, never to a redirected variant.
XML sitemap — only canonical URLs appear in the sitemap.
hreflang annotations — when present, must reference canonical URLs only.
HTTP Link header — equivalent to rel=canonical, used for non-HTML resources.

If these signals disagree, Google picks one and ignores the rest. Consistency is the rule.

5.3 Common Canonicalization Mistakes

Canonical to a redirect target. Always canonical to the URL that returns 200, never to one that 301s.
Cross-domain canonical without authority. You can canonical mirror.example.com to example.com, but canonicals across unrelated domains are usually ignored.
Self-canonical with parameters present. A URL like /page?utm_source=x should canonical to /page (no parameters), not self-canonical.
Conflicting canonicals between hreflang clusters. Each hreflang cluster member must canonical to itself, not to an English default.
HTTP version canonicalizing to HTTPS but not redirecting. Use a 301 plus self-canonical on HTTPS, not just rel=canonical.

5.4 Trailing Slash and Case

Pick one and enforce it sitewide via 301 redirect. Do not rely on canonicals alone — redirects collapse the duplication, canonicals only signal it.

# nginx — enforce trailing slash + lowercase
rewrite ^/(.*[A-Z]+.*)$ /$1 permanent;  # would need lua/regex helper
location / {
  try_files $uri $uri/ =404;
}

For Next.js, set trailingSlash: true (or false) in next.config.js and stick with it. Mixing breaks canonicalization.

6. Redirects

6.1 Status Codes

Code	Use case
301 Moved Permanently	The URL has permanently moved. Passes ranking signals. Default for migrations.
302 Found	Temporary redirect. Use only for actual temporary moves (A/B tests, seasonal pages). Misused 302s leak link equity.
307 Temporary Redirect	Like 302 but preserves request method. Rare in SEO context.
308 Permanent Redirect	Like 301 but preserves request method. Functionally equivalent for SEO.
410 Gone	Page is permanently removed and not coming back. Faster removal from index than 404.
451 Unavailable for Legal Reasons	Use when content removed for legal reasons (DMCA, jurisdictional).

6.2 Redirect Chains

A chain is A → B → C. Eliminate them. Every redirect should be a single hop to the final URL. Chains:

Waste crawl budget
Add latency for users
Risk dropping signals at each hop
Break when one link in the chain dies

Maintain a redirect map spreadsheet for any migration. After deploying redirects, crawl the site and verify zero chains.

6.3 Redirect Implementation Layers

In order of preference:

Server config (nginx, Apache, Cloudflare Rules) — fastest, most reliable, executed before page load.
CMS-level redirect plugin — fine for low-volume changes, performance penalty at scale.
JavaScript redirects — last resort. Slow, fragile, sometimes ignored by crawlers.
Meta refresh redirects — never use. Treated as low-quality signal.

6.4 The Migration Redirect Pattern

When migrating URL structure:

Generate a 1:1 map of old URL → new URL for every indexed page.
Implement 301s in server config before deploy.
Update internal links to point to new URLs (do not rely on the redirect).
Update XML sitemap to list only new URLs.
Submit new sitemap in GSC.
Monitor GSC's Coverage report for 30-90 days.
Keep redirects in place permanently — old URLs have inbound links from sites you don't control.

Cross-reference: framework-migration.md for full migration methodology.

7. URL Structure

7.1 The Eight URL Rules

Lowercase. Always.
Hyphens between words. Not underscores. Not camelCase.
Under 60 characters when possible. Long URLs index fine but truncate in SERPs.
No stop words unless meaningful. /the-best-web-hosting/ reads better as /best-web-hosting/.
Descriptive, not numeric. /blog/post-1234/ is opaque; /blog/local-seo-checklist/ is meaningful.
No file extensions where avoidable. /about/ over /about.html or /about.php.
One canonical separator policy. Don't mix /category/post/ and /category-post/.
Stable. Once published, do not change a URL without redirecting.

7.2 URL Hierarchy and Crawl Depth

A URL's path depth (/a/b/c/d/page/) is independent of crawl depth (clicks from homepage). Crawl depth matters for SEO; path depth is only loosely related.

Target: every important page reachable in 3 clicks or fewer from the homepage. Verify with Sitebulb's Crawl Depth report.

Cross-reference: framework-internallinking.md for hub-and-spoke architecture.

7.3 Parameter Handling

URL parameters create duplication. Strategies:

Block in robots.txt — for tracking parameters that should never be indexed (?utm_*, ?sessionid=, ?fbclid=).
Canonical to parameterless version — for sort/filter parameters where the parameterless version is canonical.
Self-canonical with noindex — for parameter combinations that are unique pages but should not be indexed.
GSC URL Parameter tool — deprecated as of 2022. Use canonical signals instead.

8. XML Sitemaps

8.1 What Belongs in a Sitemap

Only canonical, indexable, 200-status URLs that you want indexed. Everything else stays out:

Excluded: noindex pages, redirected URLs, 4xx URLs, duplicate URLs, parameterized variants
Excluded: pagination URLs (page/2/, page/3/) unless you have a strategic reason
Excluded: search result pages, login pages, account pages

8.2 Sitemap Structure

For sites under 50,000 URLs, a single sitemap is fine. Above that, use a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-pages.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-images.xml</loc>
    <lastmod>2026-05-05</lastmod>
  </sitemap>
</sitemapindex>

Per-sitemap entry:

<url>
  <loc>https://example.com/page/</loc>
  <lastmod>2026-05-05</lastmod>
</url>

<changefreq> and <priority> are ignored by Google. <lastmod> is honored when accurate; if you fake it (always = today), Google starts ignoring it.

8.3 Specialized Sitemaps

Image sitemap — for image-heavy sites; accelerates Image Pack inclusion.
Video sitemap — required for native video content; alternative to VideoObject schema.
News sitemap — for sites approved for Google News; only includes articles from the last 48 hours.
hreflang sitemap — programmatic alternative to inline hreflang tags.

8.4 Submission

Submit sitemap URL in GSC and Bing Webmaster Tools.
Reference sitemap URL in robots.txt (Sitemap: https://...).
For large sites, ping the sitemap when content changes (or use IndexNow).

9. Status Codes — The Operator's Reference

Beyond redirects, status codes communicate site health to crawlers. The full inventory worth knowing:

9.1 2xx Success

200 OK — page delivered. Default for healthy URLs.
201 Created — resource created (POST endpoints).
204 No Content — success, no body returned.
206 Partial Content — range request (video, large download resume).

9.2 3xx Redirection

301 / 308 — permanent. Default for migrations.
302 / 307 — temporary. Use sparingly.
304 Not Modified — conditional GET succeeded; client uses its cached copy. Healthy at scale.

9.3 4xx Client Errors

400 Bad Request — malformed request. Usually bot or attack traffic.
401 Unauthorized — auth required.
403 Forbidden — access denied. If on a public page, check nginx/htaccess rules.
404 Not Found — URL does not exist. Track in GSC and either redirect or 410.
405 Method Not Allowed — right URL, wrong verb. Common form-handler misconfiguration.
410 Gone — permanently removed. Faster removal from index than 404.
429 Too Many Requests — rate limited. Usually bot abuse.
451 Unavailable for Legal Reasons — content removed for legal cause.

9.4 5xx Server Errors

500 Internal Server Error — application crash.
502 Bad Gateway — nginx couldn't reach upstream.
503 Service Unavailable — maintenance or overload. Use Retry-After header.
504 Gateway Timeout — upstream too slow.

5xx codes are urgent. Sustained 5xx during a Googlebot crawl drops pages from the index.

10. JavaScript Rendering

10.1 The Two-Wave Indexing Model (Mostly Obsolete)

For years, Google's two-wave model meant JS sites were indexed late: the first wave indexed HTML, the second wave indexed rendered content days later. As of 2025, Google renders the vast majority of pages within hours of the first crawl. The two-wave problem is no longer a structural blocker for most sites.

It is still real for AI crawlers. GPTBot, ClaudeBot, and PerplexityBot do not all render JS reliably. For AI search visibility, server-side rendered content matters more than for traditional Google SEO.

10.2 Rendering Strategy by Content Type

Content type	Recommended rendering
Marketing pages, landing pages	SSG or SSR (no client-side rendering for primary content)
Blog posts, articles	SSG
Ecommerce product pages	SSR with hydration
Logged-in user pages	CSR is fine (these aren't indexed anyway)
Real-time data displays	SSR shell + client hydration

10.3 Validation

Use Google's Mobile-Friendly Test or URL Inspection tool's "Test Live URL" to see what Googlebot actually renders. If primary content is missing, the page is not effectively indexed even if it returns 200.

For AI crawler visibility, curl -A "GPTBot" https://example.com/page and inspect the HTML response body. If the content is in <noscript> only or arrives via fetch/XHR, AI crawlers miss it.

Cross-reference: framework-headless.md and framework-nextjs.md for framework-specific rendering patterns.

11. HTTPS

In 2026, HTTPS is non-negotiable. HTTP-only sites suffer ranking penalties, browser warnings, and lost trust signals.

11.1 Required Configuration

TLS 1.2 minimum, TLS 1.3 preferred.
Valid certificate from a recognized CA (Let's Encrypt, Cloudflare, paid).
HSTS header: Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
HSTS preload list submission for high-value domains: hstspreload.org
HTTP-to-HTTPS 301 redirect, single hop.
Canonical and internal links use HTTPS exclusively.
Mixed-content audit: no HTTP resources loaded on HTTPS pages.

11.2 Certificate Maintenance

Auto-renewal via Let's Encrypt or hosting provider.
Monitor expiration via UptimeRobot or similar.
Verify certificate transparency log entries via crt.sh.

Cross-reference: framework-security.md for broader security posture.

12. International / hreflang

For sites targeting multiple languages or regions, hreflang annotations tell Google which version to show which user.

12.1 hreflang Implementation

Three valid placement methods, listed in order of preference:

HTTP Link header — non-HTML resources, full programmatic control.
XML sitemap hreflang — preferred for large sites; centralizes all annotations in one place.
<link rel="alternate"> tags in <head> — most common; works but harder to maintain at scale.

<link rel="alternate" hreflang="en-US" href="https://example.com/en-us/page/">
<link rel="alternate" hreflang="en-GB" href="https://example.com/en-gb/page/">
<link rel="alternate" hreflang="x-default" href="https://example.com/page/">

12.2 hreflang Rules

Every page in a cluster must reference every other page in the cluster (return-link requirement).
Every page must self-reference.
Use x-default for the fallback when no language matches.
Use ISO 639-1 language + ISO 3166-1 region (en-US, not en_US).

Cross-reference: framework-international.md for full hreflang depth.

13. Crawler Observability

Knowing what crawlers actually do on a site, not what you think they do.

13.1 Server Log Analysis

The single best technical SEO data source. Server logs (nginx access logs, Apache access logs, Cloudflare logs) record every bot request with status, response time, and user agent.

Tools:

Screaming Frog Log File Analyser — desktop, parses nginx/Apache logs
GoAccess — terminal log viewer with web reports
Custom log pipeline — for clients with infrastructure depth, ship logs to BigQuery / DuckDB and query directly

What to look for:

Crawl frequency per URL (which pages does Googlebot revisit most often?)
Crawl-budget waste (low-value URLs with high crawl rate)
4xx/5xx clusters (pages crawlers are repeatedly hitting that error)
Slow URLs (high response time correlates with reduced crawl rate)
Bot fingerprint verification (real Googlebot? or a UA-spoofing scraper?)

13.2 Bot Verification

Anyone can claim to be Googlebot. Verify:

Reverse DNS lookup of the bot's IP must resolve to googlebot.com or google.com
Forward DNS of that hostname must resolve back to the same IP
Google publishes its IP ranges at developers.google.com/search/apis/ipranges/googlebot.json

For Bing: bingbot.com reverse DNS. For Apple: applebot.apple.com. AI crawlers vary; ClaudeBot publishes IP ranges, GPTBot publishes IP ranges, PerplexityBot publishes IP ranges.

13.3 GSC URL Inspection

For specific URLs, GSC's URL Inspection tool shows:

Last crawl date
Last response code
Indexed status
Mobile usability
Rendered HTML (live test)
Discovered referring URLs

Use this to debug specific indexing problems.

14. Audit Mode

#	Criterion	Pass/Fail
TS1	robots.txt returns 200 plain text, allows critical resources
TS2	XML sitemap returns 200, validates, lists only canonical indexable URLs
TS3	Sitemap submitted to Google Search Console and Bing Webmaster
TS4	All canonical URLs return 200
TS5	Self-referential rel=canonical on every indexable page
TS6	HTTP-to-HTTPS 301 redirect, single hop
TS7	www / non-www unified via 301, single hop
TS8	Trailing slash policy enforced sitewide
TS9	URLs lowercase, no mixed-case duplicates
TS10	Zero redirect chains (every redirect single-hop)
TS11	No 4xx URLs in sitemap or internal links
TS12	No 5xx URLs detected in last 30 days
TS13	HSTS header present with min 1-year max-age
TS14	TLS 1.2+ enforced, valid certificate, no mixed content
TS15	Mobile rendering verified (Mobile-Friendly Test)
TS16	JS-rendered content visible to Googlebot via URL Inspection
TS17	IndexNow key deployed (for Bing/Yandex/Naver indexing)
TS18	hreflang correctly implemented if multi-region
TS19	Bot verification logic in place (no UA-spoofing scrapers treated as bots)
TS20	Server logs analyzed at least quarterly for crawl-budget waste
TS21	Zero soft 404s in GSC
TS22	Crawled — currently not indexed bucket under 10% of total
TS23	Discovered — currently not indexed bucket under 5% of total
TS24	URL parameter strategy documented (block / canonical / index per parameter)
TS25	Pagination strategy documented (rel=next/prev replaced or supplemented)
TS26	AI crawler policy in robots.txt explicit and documented
TS27	Duplicate-content audit completed in last 90 days
TS28	All 301 redirects retained from migrations (don't expire redirects)
TS29	Crawl depth report shows zero pages over depth 3 (small sites) or 5 (large)
TS30	URLs under 60 characters where possible
TS31	Server response time under 600ms for HTML responses
TS32	GSC Coverage report shows zero "server error" URLs
TS33	GSC URL Inspection on 5 random pages confirms canonical, indexed status, no rendering issues
TS34	Lighthouse SEO score 100 on representative sample of pages
TS35	No JavaScript-only navigation (every link reachable via crawl without rendering)

Score: 35. World-class: 33+/35.

15. Common Mistakes

Blocking CSS/JS in robots.txt — Google needs them to render. Frequently breaks indexing.
Canonical pointing to a redirect target — invalidates the canonical signal.
Multiple canonical signals disagreeing — internal links say A, sitemap says B, rel=canonical says C; Google picks one and ignores the others.
Trailing slash inconsistency — half the site with /, half without; treated as duplicate URLs.
Redirect chains — A→B→C→D wastes crawl, leaks signals.
302s where 301 belongs — temporary redirect on a permanent move; ranking signals leak.
Soft 404s — page returns 200 but says "not found"; Google detects and demotes.
Indexable thin pages — tag archives, paginated category pages, search result pages with no value indexed without filtering.
JavaScript navigation only — links rendered via JS; crawl depth report shows orphaned pages.
Stale <lastmod> in sitemap — every URL claims today's date; Google starts ignoring lastmod entirely.
HTTPS deployed but HTTP not redirected — both versions serve, both indexed, duplicate-content penalty.
Mixed content — HTTPS page loads HTTP resources; browser blocks, layout breaks, ranking suffers.
AI crawler blocking by accident — wildcard Disallow: / block applied to legitimate AI crawlers losing citation traffic.
No IndexNow — Bing, Yandex, Naver indexing days late when push-based could do it in minutes.
Forgotten redirects after migration — old redirects removed prematurely; old inbound links 404.

16. Maintenance

Weekly:

Review GSC Coverage report for new errors
Check Bing Webmaster for crawl issues
Spot-check new URLs are indexable and in sitemap

Monthly:

Full GSC report review across all categories
Sitemap regeneration verification
Redirect map audit
Server log spot-check for 4xx/5xx clusters

Quarterly:

Comprehensive crawl audit (Screaming Frog or Sitebulb)
Server log deep analysis
Crawl-budget evaluation
Canonical audit
HTTPS / HSTS verification
AI crawler policy review (new bots emerge frequently)

Annually:

Full URL inventory and structural review
Migration redirect retention verification
Hreflang validation if international
IndexNow key rotation if compromised

17. Companion Documents

framework-pageexperience.md — Core Web Vitals, mobile, intrusive interstitials
framework-schema.md — Structured data implementation
framework-internallinking.md — Hub-and-spoke architecture
framework-migration.md — Site moves and URL restructures
framework-security.md — Broader security posture
framework-international.md — Full hreflang depth
framework-aicitations.md — AI crawler policy and visibility
framework-headless.md — Headless CMS rendering patterns

Document version: 1.0 Last updated: 2026-05-05 Owner: Joseph W. Anady — ThatDeveloperGuy — SDVOSB

Frequently asked questions

How do I conserve crawl budget on a large site?

Crawl budget matters above roughly 50,000 URLs. To conserve it: block parameterized and sessioned URLs in robots.txt, serve 410 on truly dead URLs (faster removal than 404), reduce internal links to low-value pages, apply noindex to pages that shouldn't be indexed (404 templates, search results, thin tag archives), and maintain a clean XML sitemap so the crawler has a prioritized list.

What is the canonical signal stack and why use all of it?

Canonical signals reinforce each other, so use all: a self-referential rel=canonical link, 301 redirects for true duplicates, internal links that always point to the canonical URL, an XML sitemap containing only canonical URLs, hreflang annotations referencing canonical URLs only, and the HTTP Link header for non-HTML resources. If these signals disagree, Google picks one and ignores the rest. Consistency is the rule.

Does JavaScript rendering still hurt SEO in 2026?

For Google, mostly no. The old two-wave model is largely obsolete; as of 2025 Google renders most pages within hours of the first crawl. But it still matters for AI crawlers: GPTBot, ClaudeBot, and PerplexityBot do not all render JS reliably, so server-side rendered content matters more for AI search visibility. Test with curl -A "GPTBot" and inspect the HTML body.

How should I set my robots.txt AI crawler policy?

In 2026 the question is which AI crawlers you want citing you versus blocking. A typical posture allows GPTBot, ClaudeBot, PerplexityBot, and Google-Extended for citation traffic, while blocking aggressive scrapers like AhrefsBot, SemrushBot, and MJ12Bot by default. robots.txt is advisory, not security; never block CSS, JS, images, or fonts that crawlers need to render.

What is IndexNow and which engines support it?

IndexNow is a push-based indexing protocol supported by Bing, Yandex, Seznam, and Naver, but not Google. When a URL changes you POST it to api.indexnow.org and supported engines crawl within minutes instead of days. Implementation: generate a random 32-character API key, place it as /{key}.txt at the domain root, then POST the host, key, keyLocation, and urlList as JSON.

Want this framework implemented on your site?

ThatDevPro ships these frameworks as productized services. SDVOSB-certified veteran owned. Cassville, Missouri.

See Engine Optimization service ›