The Complete Guide to AI Discoverability

01

Why AI discoverability is different from SEO.

Search engine optimization and AI discoverability solve different problems. SEO is about ranking in Google search results. AI discoverability is about whether AI systems can access your content at all, understand what your site is about, and choose to cite it in their answers.

The two use almost nothing in common. Different bots, different configuration files, different signals. A site can rank on page one of Google while being entirely inaccessible to ChatGPT, Claude, and Perplexity.

Training bots vs retrieval bots

AI bots fall into two categories with different behaviors and different urgency for site owners.

Training bots crawl the web to build a model's underlying knowledge. Examples include ClaudeBot (Anthropic), GPTBot (OpenAI), CCBot (Common Crawl), and Bytespider (ByteDance). These bots do not produce immediate citations. Their crawls shape what the model knows about your brand and domain over months and model update cycles.

Retrieval bots power live AI search results and AI overviews. Examples include PerplexityBot and Google-Extended (for AI Overviews and Gemini). These bots can surface your content as a cited source the same day they crawl it. Blocking them means your site is absent from AI search results in real time.

Both types require explicit per-bot access configuration. Standard SEO practice (a wildcard User-agent: * rule) does not apply to AI bots in most frameworks. Each bot must be listed by name.

02

Configure robots.txt for AI bots.

robots.txt is the first thing AI crawlers check. If a bot's User-agent string is not listed explicitly by name with a permissive rule, the bot may treat itself as blocked, depending on how the operator has configured their crawler.

The wildcard rule does not apply to AI bots. A User-agent: * block does not give AI crawlers permission. Each bot must have its own named block. This is different from standard SEO where wildcard rules typically cover all crawlers.

The nine AI bots you need to allow

GPTBot OpenAI ChatGPT training and knowledge base crawling

ChatGPT-User OpenAI Real-time web browsing when ChatGPT searches the web

ClaudeBot Anthropic Claude training and knowledge crawling

PerplexityBot Perplexity AI Live AI search results and cited source retrieval

Google-Extended Google Gemini AI and AI Overviews (separate from Googlebot)

Applebot-Extended Apple Apple Intelligence features and Siri summaries

Bytespider ByteDance AI products including TikTok search and recommendations

CCBot Common Crawl Public web dataset used across many AI training pipelines

cohere-ai Cohere Enterprise AI applications and retrieval systems

bingbot Microsoft Bing index powering Microsoft Copilot and Bing AI features

Meta-ExternalAgent Meta Meta AI training crawl for Facebook, Instagram, and WhatsApp AI

Amazonbot Amazon Amazon AI products including Alexa and Rufus

DuckAssistBot DuckDuckGo Real-time crawl for DuckDuckGo AI-assisted answers with cited sources

MistralAI-User Mistral AI Real-time web retrieval for Mistral Le Chat cited responses

Copy-paste robots.txt configuration

Add these blocks to your existing robots.txt. Order does not matter. Place them alongside your existing Googlebot or wildcard rules.

# AI crawler access - each bot requires its own named block

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Allow: /

User-agent: CCBot
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: bingbot
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: DuckAssistBot
Allow: /

User-agent: MistralAI-User
Allow: /

If you want to block specific bots from crawling certain paths while allowing the rest, use Disallow: /private/ within that bot's block. An empty Disallow or explicit Allow: / grants full access.

03

Create llms.txt.

llms.txt is a Markdown file placed at your domain root that gives AI models a structured summary of your site. It was proposed by Jeremy Howard (Answer.AI) in 2024 and has seen rapid adoption among sites that want to improve AI discoverability without waiting for AI systems to crawl every page.

The file is not a ratified standard, but it is supported in practice by several AI systems and is the closest thing to a universal AI context file currently available.

The format

llms.txt uses standard Markdown. The structure is straightforward:

H1 (required): The first line must be a Markdown H1 heading with your site name. This is the primary label AI systems use to identify your site.
Blockquote (optional): A brief description of what your site does. Surround it with > to mark it as a blockquote.
H2 sections (required): Organize your content into categories. At least one H2 is needed for valid structure.
Markdown links (required): List your key pages as [Label](https://full-url). Use full absolute URLs.

# Your Company Name

> One or two sentences describing what your company does and who it serves.
> Keep this concise - it is what AI models use to introduce your organization.

## Products

- [Product Name](https://yoursite.com/product): Brief description of what this product does.
- [Another Product](https://yoursite.com/other): What this one does and who it is for.

## Resources

- [Documentation](https://yoursite.com/docs): Technical documentation and API reference.
- [Blog](https://yoursite.com/blog): Articles and updates from the team.

## About

- [About us](https://yoursite.com/about): Company background, team, and mission.
- [Contact](https://yoursite.com/contact): How to get in touch.

llms-full.txt (optional)

You can also create an llms-full.txt at your domain root containing the full text of your most important pages. This gives AI systems deeper context without requiring them to crawl each page individually. It is most useful for documentation sites, knowledge bases, and content-heavy sites where page depth matters.

Beaconly checks for its existence but does not require it. It is marked as optional in the audit.

Practical tips

Keep llms.txt under 100 KB. AI systems may truncate very large files.
Use absolute URLs, not relative paths. AI systems may read the file out of context.
Include your most linked-to and most important pages, not every URL on the site.
Update it when you add major new content sections or products. It does not need to be updated constantly.
Serve it as plain text (Content-Type: text/plain). Most servers handle this automatically for .txt files.

04

Add JSON-LD structured data.

JSON-LD structured data tells AI systems who you are, what your site covers, and whether your content is current. It uses the Schema.org vocabulary and is placed in a <script type="application/ld+json"> tag in your page head. It is not visible to users but is read by crawlers that parse HTML.

The most effective pattern is an @graph block that combines multiple node types in a single script tag. This lets you link nodes to each other using @id references rather than duplicating data across blocks.

Organization node

The Organization node is the foundation. It establishes who operates the site. Every other node can reference it by @id.

@id: A stable URL identifier, by convention https://yourdomain.com/#organization. Must be unique and permanent.
name: Your organization's name, exactly as you want it cited.
url: Your canonical homepage URL.
sameAs: An array of verified profile URLs (LinkedIn, GitHub, Crunchbase, etc.). AI systems use these to cross-reference your identity.
knowsAbout: An array of strings describing your organization's areas of expertise or core topics.

FAQPage node

FAQPage schema is one of the highest-value schema types for AI citation. It structures your questions and answers in a format AI systems can read directly without parsing prose content. If your page has a FAQ section, mark it up.

WebPage node with speakable

The WebPage node signals the canonical URL, name, and modification date of a page. The speakable property, using SpeakableSpecification, tells AI systems which sections of your page contain the most important content. Point it at CSS selectors for your hero and key content sections.

Working example: complete @graph

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "Organization",
      "@id": "https://yoursite.com/#organization",
      "name": "Your Company Name",
      "url": "https://yoursite.com/",
      "sameAs": [
        "https://www.linkedin.com/company/your-company/",
        "https://github.com/yourorg"
      ],
      "knowsAbout": ["your topic", "another topic", "core service area"]
    },
    {
      "@type": "WebPage",
      "@id": "https://yoursite.com/page#webpage",
      "url": "https://yoursite.com/page",
      "name": "Page Title | Site Name",
      "description": "A clear description of what this page covers.",
      "dateModified": "2026-04-13",
      "speakable": {
        "@type": "SpeakableSpecification",
        "cssSelector": [".hero-content", ".main-content"]
      }
    },
    {
      "@type": "FAQPage",
      "@id": "https://yoursite.com/page#faq",
      "mainEntity": [
        {
          "@type": "Question",
          "name": "Your first question?",
          "acceptedAnswer": {
            "@type": "Answer",
            "text": "A complete, accurate answer to the question."
          }
        }
      ]
    }
  ]
}
</script>

Update dateModified whenever you make significant changes to a page. AI systems use this date to judge content freshness. Stale dates or missing dates reduce the likelihood of being cited in time-sensitive queries.

05

Page structure basics.

Beyond robots.txt, llms.txt, and schema, AI crawlers evaluate basic page signals to decide whether a page is a reliable, well-structured source. These are not AI-specific requirements, but they matter for AI discoverability the same way they matter for SEO.

Single H1 heading

Every page should have exactly one <h1> tag. AI systems treat the H1 as the definitive title of the page. Zero H1s leaves the page without a clear title. Multiple H1s create ambiguity about what the page is actually about.

Meta description (120 to 160 characters)

The meta description is used as the default summary when AI systems cite your page. Keep it between 120 and 160 characters. Write it as a complete sentence that accurately describes the page. Descriptions that are too short lack useful context; too long and they get truncated.

Canonical URL

Add <link rel="canonical" href="https://yoursite.com/page"> to every page. This tells crawlers which URL is the authoritative version. Without it, content accessible at multiple URLs (with and without trailing slashes, with query strings, etc.) may be treated as duplicate pages.

Open Graph tags

Include og:title, og:description, and og:image on every page. These tags are used when AI systems generate link previews and citations in interfaces that display rich cards.

HTTPS

Serve your site over HTTPS. HTTP-only pages are deprioritized by modern crawl infrastructure. Free SSL certificates are available through most hosting providers via Let's Encrypt.

Server-rendered content

AI crawlers do not execute JavaScript. If your page content is rendered client-side only (React, Vue, Angular without SSR), crawlers see an empty or near-empty HTML shell. Key content must be present in the initial server-rendered HTML. Use server-side rendering or static site generation for pages you want AI systems to read.

Response time

Aim for sub-2-second response times. Crawlers have strict timeout budgets. Pages that are slow to respond may be skipped entirely, particularly for supplementary resources like robots.txt and sitemaps. CDN caching and static file serving are the most reliable ways to hit this target.

The complete guide to
AI discoverability.

Why AI discoverability is different from SEO.

Training bots vs retrieval bots

Configure robots.txt for AI bots.

The nine AI bots you need to allow

Copy-paste robots.txt configuration

Create llms.txt.

The format

llms-full.txt (optional)

Practical tips

Add JSON-LD structured data.

Organization node

FAQPage node

WebPage node with speakable

Working example: complete @graph

Page structure basics.

Single H1 heading

Meta description (120 to 160 characters)

Canonical URL

Open Graph tags

HTTPS

Server-rendered content

Response time

Check your
current configuration.

The complete guide to AI discoverability.

Why AI discoverability is different from SEO.

Training bots vs retrieval bots

Configure robots.txt for AI bots.

The nine AI bots you need to allow

Copy-paste robots.txt configuration

Create llms.txt.

The format

llms-full.txt (optional)

Practical tips

Add JSON-LD structured data.

Organization node

FAQPage node

WebPage node with speakable

Working example: complete @graph

Page structure basics.

Single H1 heading

Meta description (120 to 160 characters)

Canonical URL

Open Graph tags

HTTPS

Server-rendered content

Response time

Check yourcurrent configuration.

The complete guide to
AI discoverability.

Check your
current configuration.