SEO & AI Engine Optimization Framework · May 2026

Multimodal Search: text, image, voice, and video query optimization

A comprehensive installation and audit reference for multimodal search optimization — the layer where AI systems understand content across visual, audio, video, and text simultaneously. Where…

Visual + Audio + Text Understanding, Google Lens Optimization, Image-Based Discovery, Audio Content Indexing, and the Convergence of Search Modalities

A comprehensive installation and audit reference for multimodal search optimization — the layer where AI systems understand content across visual, audio, video, and text simultaneously. Where traditional search treated images as accompaniments to text and audio as an afterthought, modern AI systems (Gemini, GPT-4o, Claude with vision) reason about content holistically across modalities.

Cross-stack implementation note: the code samples in this framework are written in plain HTML for clarity. For React, Vue, Svelte, Next.js, Nuxt, SvelteKit, Astro, Hugo, 11ty, Remix, WordPress, Shopify, and Webflow equivalents of every pattern below, see framework-cross-stack-implementation.md. For pure client-rendered SPAs (no SSR/SSG) see framework-react.md. For Tailwind-specific concerns (purge, dynamic classes, dark-mode CLS, focus accessibility) see framework-tailwind.md.


1. Document Purpose

The traditional SEO mental model: text content gets indexed, images get alt text for accessibility and image search, videos get tags and descriptions, audio is barely indexed at all. Each modality optimized somewhat independently.

The multimodal AI reality: a single model processes all modalities simultaneously. When a user asks an AI engine "What's the cheapest way to fix this?" while showing a photo of a broken fence, the AI sees the photo, understands the damage, retrieves relevant content from indexed sites, and synthesizes a response. The text content of indexed sites must align with what's visible in user images. The visual content of indexed sites must convey information accurately. The audio content of indexed videos must be transcribable and meaningful.

This framework specifies optimization across the integrated modality landscape:

The 2025-2026 emergence of robust multimodal AI (Gemini 2.0+, GPT-4o, Claude 4 with vision) makes this layer increasingly important for content discovery.

1.1 Required Tools


2. Visual Content Optimization

Beyond traditional image SEO (covered in framework-imageseo.md), multimodal optimization requires:

2.1 Visual Content Should Convey Information

Traditional image SEO: "image with alt text describing what's in it."

Multimodal optimization: "image that conveys information AI can reason about."

multimodal_image_principles:
  
  informational_density:
    description: "Images should communicate information, not just decoration"
    examples:
      good: "Diagram showing relationship between concepts"
      good: "Annotated screenshot with labeled elements"
      good: "Before/after comparison with clear labels"
      weak: "Generic stock photo of person typing on laptop"
  
  text_in_images:
    description: "Text within images should be readable by OCR"
    standards:
      - "High contrast text"
      - "Sans-serif fonts (better OCR)"
      - "Sufficient size (12pt minimum at intended display size)"
      - "Avoid stylized text where information matters"
  
  diagram_clarity:
    description: "Diagrams should communicate relationships clearly"
    standards:
      - "Clear labels"
      - "Logical flow"
      - "Distinguishable elements"
      - "Legend if needed"
  
  consistency_with_surrounding_content:
    description: "Image should match what surrounding text describes"
    why: "AI extracts both; inconsistency confuses"

2.2 Alt Text for AI Reasoning

Traditional alt text: brief description.

Multimodal-optimized alt text: comprehensive description enabling reasoning.

<!-- Traditional alt text -->
<img src="/diagram.png" alt="14-tier framework diagram">

<!-- Multimodal-optimized alt text -->
<img src="/diagram.png" alt="ThatDeveloperGuy 14-tier optimization framework diagram showing hierarchical structure: Tier 1 Foundation containing 28 items including HTML structure and meta tags, Tier 2 Search Visibility containing 28 items, Tier 3 AI Domination containing 20 items including AEO, GEO, and LLMO, continuing through Tier 14 Advanced and Immersive containing 8 items including AGO, ARO, VRO, and SPC. Total of 183 individual optimizations across the framework.">

The longer alt text:

For decorative images, empty alt is still appropriate. For informational images, comprehensive alt unlocks AI extraction.

2.3 Image Captions

Captions visible to users serve dual purpose: human reading and AI understanding.

<figure>
  <img src="/team-photo.jpg" alt="ThatDeveloperGuy founder Joseph Anady at Cassville Missouri office">
  <figcaption>
    Joseph W. Anady, founder of ThatDeveloperGuy, in the Cassville, Missouri office (April 2026). 
    The Service-Disabled Veteran-Owned Small Business currently manages 130+ production websites 
    on self-managed Linux infrastructure.
  </figcaption>
</figure>

Captions provide:

2.4 Image Schema

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "ImageObject",
  "@id": "https://example.com/images/team-photo.jpg",
  "url": "https://example.com/images/team-photo.jpg",
  "contentUrl": "https://example.com/images/team-photo.jpg",
  "width": 1600,
  "height": 900,
  "caption": "ThatDeveloperGuy founder Joseph Anady at Cassville office",
  "description": "Photograph showing Joseph W. Anady, Service-Disabled Veteran-Owned Small Business founder, in the Cassville Missouri office where ThatDeveloperGuy operates. The 130+ production client websites are managed from this location on self-managed Linux infrastructure named Bubbles.",
  "creator": {"@id": "https://thatdeveloperguy.com/#organization"},
  "creditText": "Photo by ThatDeveloperGuy",
  "copyrightNotice": "© 2026 ThatDeveloperGuy",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "datePublished": "2026-04-15",
  "subjectOf": {
    "@id": "https://thatdeveloperguy.com/about/joseph-anady/#person"
  },
  "contentLocation": {
    "@type": "Place",
    "name": "Cassville, Missouri",
    "geo": {
      "@type": "GeoCoordinates",
      "latitude": 36.6770,
      "longitude": -93.8730
    }
  }
}
</script>

Comprehensive image schema enables AI to understand:

2.5 Visual Content for Specific Use Cases

visual_content_by_purpose:
  
  product_photography:
    multimodal_optimization:
      - "Multiple angles for AI to understand product fully"
      - "Scale references (objects of known size)"
      - "Detail shots with clear focus on features"
      - "Use context shots showing product in use"
    schema:
      - "Product schema with image array"
      - "Detailed alt text describing product features"
  
  team_photos:
    multimodal_optimization:
      - "Real team members, not stock photos (AI detects)"
      - "Professional but authentic"
      - "Names associated with photos via Person schema"
      - "Consistent across site (same photos in author bios, about page)"
  
  diagrams_and_infographics:
    multimodal_optimization:
      - "Self-contained (don't require surrounding context)"
      - "Clear visual hierarchy"
      - "Embedded text readable"
      - "Description below image summarizing for AI extraction"
  
  screenshots:
    multimodal_optimization:
      - "Annotated when steps shown"
      - "Original quality (don't compress to unreadability)"
      - "Up-to-date (UI changes break understanding)"
      - "Caption explaining what's shown"
  
  data_visualizations:
    multimodal_optimization:
      - "Chart type appropriate to data"
      - "Axis labels clear"
      - "Values readable"
      - "Title and legend present"
      - "Provide data table alongside chart for AI extraction"

3. Audio Content Optimization

Audio content (podcasts, video voiceovers, audiobooks, etc.) is increasingly indexed and extractable.

3.1 Transcription as Foundation

Audio without transcripts is largely invisible to text-based search and AI.

transcription_strategy:
  
  podcasts:
    requirement: "Full episode transcripts on website"
    format: "Time-stamped if possible, allowing seek to specific moments"
    location: "Dedicated episode page on website + show notes on podcast platforms"
  
  video_voiceovers:
    requirement: "Captions/subtitles + companion article transcript"
    format: "WebVTT or SRT for video; full text on page"
  
  audiobooks_and_audio_courses:
    requirement: "Searchable transcript per chapter/lesson"
    benefit: "Captures long-tail queries from spoken content"
  
  recorded_webinars:
    requirement: "Full transcript + key timestamps"
    format: "Embedded with video; companion article"

3.2 Audio Schema

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "PodcastEpisode",
  "@id": "https://example.com/podcast/episode-42/#episode",
  "name": "How AI Search is Reshaping Small Business Marketing",
  "description": "Comprehensive discussion of AI search optimization, AEO, and how small businesses can adapt.",
  "datePublished": "2026-04-29",
  "duration": "PT45M30S",
  "url": "https://example.com/podcast/episode-42/",
  "associatedMedia": {
    "@type": "MediaObject",
    "contentUrl": "https://example.com/podcast/episode-42.mp3",
    "encodingFormat": "audio/mpeg",
    "duration": "PT45M30S"
  },
  "partOfSeries": {
    "@type": "PodcastSeries",
    "name": "ThatDeveloperGuy Podcast",
    "url": "https://example.com/podcast/"
  },
  "transcript": "Full transcript content available on page",
  "actor": [
    {"@type": "Person", "name": "Joseph W. Anady"},
    {"@type": "Person", "name": "Guest Name"}
  ]
}
</script>

For standalone audio content:

<script type="application/ld+json">
{
  "@type": "AudioObject",
  "name": "Audio content title",
  "description": "...",
  "contentUrl": "https://example.com/audio.mp3",
  "duration": "PT12M",
  "transcript": "Full transcript URL or content"
}
</script>

3.3 Podcast SEO Specifically

podcast_seo:
  
  podcast_directories:
    - Apple Podcasts (largest US audience)
    - Spotify (rapidly growing podcast platform)
    - Google Podcasts (deprecated; migrate to YouTube Music)
    - Amazon Music
    - YouTube Podcasts (growing rapidly)
    - Niche directories per industry
  
  episode_optimization:
    - Title with primary keyword and compelling hook
    - Show notes 500+ words with timestamps and links
    - Full transcript on episode page
    - Episode artwork distinct per episode
    - Guest information with links
  
  per_episode_landing_page:
    structure:
      - Episode title and metadata
      - Embedded player
      - Show notes with timestamps
      - Full transcript
      - Resources mentioned (with links)
      - Guest bio (if applicable)
      - Related episodes
      - PodcastEpisode schema
  
  episode_promotion:
    - Social media (audiograms — short video clips with audio)
    - Email newsletter
    - Cross-promotion with other podcasts
    - Coverage in industry publications

4. Video Multimodal Optimization

Video combines visual + audio. See framework-videoseo.md for foundations. Multimodal additions:

4.1 Video for AI Multimodal Reasoning

When AI reasons about video content:

Optimization implications:

4.2 Video Chapters with Multimodal Awareness

Chapters should reflect content meaningful to multimodal extraction:

0:00 Introduction — establish context and goals
1:23 The fundamental concept (with visual diagram)
3:45 Step-by-step walkthrough (screen recording)
7:12 Common mistakes (with annotated examples)
9:30 Advanced applications (with case studies)
12:00 Summary and next steps (with key visuals)

Each chapter should have visual content matching audio content. AI extracting "common mistakes" from this video gets both the spoken explanation and the visual examples.

4.3 Live Streaming and Multimodal AI

Live content increasingly indexed:

For business-relevant live content (webinars, product launches, Q&A sessions):


5. Cross-Modal Content Strategy

5.1 The Pillar Content Approach

For high-value topics, create comprehensive coverage across modalities:

pillar_topic_multimodal_coverage:
  
  example_topic: "AI Search Optimization for Small Business"
  
  text_content:
    - 5,000-word comprehensive article
    - Subtopic articles linking to pillar
    - Glossary of key terms
  
  visual_content:
    - Hero diagram showing the framework
    - Per-section illustrations
    - Annotated screenshots
    - Comparison tables
  
  video_content:
    - 15-minute overview video
    - Per-section deep-dive videos
    - Live Q&A recording
  
  audio_content:
    - Podcast episode discussing topic
    - Audio version of article (read aloud)
  
  interactive_content:
    - Self-assessment quiz
    - Calculator for implementation effort
    - Decision tree tool
  
  consistency:
    - All modalities aligned on key messages
    - Cross-references between modalities
    - Same examples used consistently
    - Same entities/terms used consistently

5.2 Cross-Modal Internal Linking

<!-- In the article -->
<aside class="related-content">
  <h3>Also Available</h3>
  <ul>
    <li><a href="/videos/ai-search-overview/">📹 Watch the 15-minute video version</a></li>
    <li><a href="/podcast/episode-42/">🎙️ Listen to the podcast discussion</a></li>
    <li><a href="/tools/aeo-self-assessment/">🛠️ Take the self-assessment</a></li>
  </ul>
</aside>

Cross-references between modalities:

5.3 Avoiding Modality Silos

Common anti-pattern: separate teams creating text, video, and audio content with no coordination.

multimodal_coordination:
  
  editorial_calendar:
    - Plan content topics together
    - Identify which topics warrant multimodal coverage
    - Schedule production across modalities
  
  shared_assets:
    - Diagrams created once, used across blog/video/podcast
    - Quotes appear in multiple formats
    - Examples consistent across versions
  
  unified_messaging:
    - Same key points
    - Same examples
    - Same calls to action
    - Same brand voice

6. Google Lens Optimization

Google Lens identifies entities in images — businesses, products, places, plants, animals, text, more.

6.1 Visual Entity Recognition

For your business/products/locations to be recognized in Google Lens:

google_lens_optimization:
  
  business_recognition:
    - Distinct, recognizable storefront photos
    - Logo prominently displayed in business photos
    - Multiple angles of business location
    - Schema with geo coordinates
    - GBP photos comprehensive
  
  product_recognition:
    - Multiple high-quality product photos
    - Distinct product photography (not generic stock)
    - Logo/branding visible on products where applicable
    - Product schema comprehensive
    - Consistent product imagery across site
  
  location_recognition:
    - Distinctive landmarks in photos
    - Geographic context visible
    - GeoCoordinates in schema
    - Multiple recognized features

6.2 OCR-Friendly Visual Content

Google Lens reads text in images. To benefit:

6.3 Style Transfer & Visual Similarity

Lens uses visual similarity for "find more like this" features. Implications:


7. Voice Search & Conversational AI

7.1 Voice Query Optimization

Voice queries differ from typed:

voice_query_optimization:
  
  natural_language_content:
    - Write conversationally where appropriate
    - Use full sentences, not just keywords
    - Include question variants explicitly
  
  question_answer_format:
    - Address common questions directly
    - FAQ sections with Question/Answer schema
    - Featured snippet optimization (often source for voice)
  
  local_intent:
    - Local business optimization (see framework-localseo.md)
    - "Near me" query optimization
    - Voice-friendly business descriptions
  
  schema_for_voice:
    - Speakable schema (where applicable)
    - FAQPage with concise answers (40-50 words ideal for voice)

7.2 Speakable Schema

For news/article content optimized for voice reading:

<script type="application/ld+json">
{
  "@type": "WebPage",
  "speakable": {
    "@type": "SpeakableSpecification",
    "cssSelector": [".article-headline", ".article-summary"]
  }
}
</script>

This identifies content suitable for voice reading. Limited rich result support but emerging value for voice assistants.


8. Audit Mode

# Criterion Pass/Fail
MM1 Images convey information beyond decoration
MM2 Comprehensive alt text on informational images
MM3 Captions present where appropriate
MM4 ImageObject schema on important images
MM5 Audio content has transcripts
MM6 Podcast episodes have dedicated landing pages
MM7 PodcastEpisode schema implemented
MM8 Video content has captions
MM9 Video chapters meaningful
MM10 High-value topics covered across multiple modalities
MM11 Cross-modal internal linking present
MM12 Modality coordination in editorial process
MM13 Google Lens-friendly visual content
MM14 Voice query optimization patterns implemented
MM15 Multimodal AI tested for content extraction

Score: 15. World-class multimodal optimization: 14+/15.


9. Common Mistakes

  1. Decorative-only images — missed information conveyance opportunity
  2. Stock photo team images — modern AI detects, reduces trust
  3. No transcripts for audio/video — content invisible to text-based AI
  4. Modality silos — text team, video team, audio team don't coordinate
  5. Inconsistent messaging across modalities — confuses AI extraction
  6. Generic alt text — "image of woman typing" instead of specific, informational
  7. No image schema — missing context for AI understanding
  8. Voice queries treated as text queries — different intent patterns
  9. Multimodal tested only for humans, not AI — AI may extract differently
  10. Ignoring podcast SEO — major audio platform missed

10. Stack-Specific Implementation Notes

10.1 WordPress

10.2 Static Site Generators

10.3 Headless / Custom


End of Framework Document

Document version: 1.0 Last updated: 2026-04-29

Multimodal AI is reshaping content discovery. Sites that produce content across modalities, with proper schema and consistency, are positioned to be cited and surfaced by next-generation AI engines. The principles in this framework — informational density, structured data, cross-modal coordination — apply universally.

Companion documents:

Want this framework implemented on your site?

ThatDevPro ships these frameworks as productized services. SDVOSB-certified veteran owned. Cassville, Missouri.

See Engine Optimization service ›