Evergreen guide
The complete guide to
AI discoverability.
Whether AI systems can access, understand, and cite your site. Four configuration layers, ordered by impact. Every section ships with copy-paste examples.
01
Why AI discoverability is different from SEO.
Search engine optimization and AI discoverability solve different problems. SEO is about ranking in Google search results. AI discoverability is about whether AI systems can access your content at all, understand what your site is about, and choose to cite it in their answers.
The two use almost nothing in common. Different bots, different configuration files, different signals. A site can rank on page one of Google while being entirely inaccessible to ChatGPT, Claude, and Perplexity.
Training bots vs retrieval bots
AI bots fall into two categories with different behaviors and different urgency for site owners.
Training bots crawl the web to build a model's underlying knowledge. Examples include ClaudeBot (Anthropic), GPTBot (OpenAI), CCBot (Common Crawl), and Bytespider (ByteDance). These bots do not produce immediate citations. Their crawls shape what the model knows about your brand and domain over months and model update cycles.
Retrieval bots power live AI search results and AI overviews. Examples include PerplexityBot and Google-Extended (for AI Overviews and Gemini). These bots can surface your content as a cited source the same day they crawl it. Blocking them means your site is absent from AI search results in real time.
Both types require explicit per-bot access configuration. Standard SEO
practice (a wildcard User-agent: *
rule) does not apply to AI bots in most frameworks. Each bot must be listed
by name.
02
Configure robots.txt for AI bots.
robots.txt is the first thing AI crawlers check. If a bot's User-agent string is not listed explicitly by name with a permissive rule, the bot may treat itself as blocked, depending on how the operator has configured their crawler.
The wildcard rule does not apply to AI bots. A
User-agent: * block does not give AI crawlers permission.
Each bot must have its own named block. This is different from standard
SEO where wildcard rules typically cover all crawlers.
The nine AI bots you need to allow
Copy-paste robots.txt configuration
Add these blocks to your existing robots.txt. Order does not matter. Place them alongside your existing Googlebot or wildcard rules.
# AI crawler access - each bot requires its own named block
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
User-agent: cohere-ai
Allow: /
User-agent: bingbot
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: DuckAssistBot
Allow: /
User-agent: MistralAI-User
Allow: /
If you want to block specific bots from crawling certain paths while allowing
the rest, use Disallow: /private/
within that bot's block. An empty Disallow or explicit
Allow: /
grants full access.
03
Create llms.txt.
llms.txt is a Markdown file placed at your domain root that gives AI models a structured summary of your site. It was proposed by Jeremy Howard (Answer.AI) in 2024 and has seen rapid adoption among sites that want to improve AI discoverability without waiting for AI systems to crawl every page.
The file is not a ratified standard, but it is supported in practice by several AI systems and is the closest thing to a universal AI context file currently available.
The format
llms.txt uses standard Markdown. The structure is straightforward:
- H1 (required): The first line must be a Markdown H1 heading with your site name. This is the primary label AI systems use to identify your site.
-
Blockquote (optional): A brief description of what your site does. Surround it with
>to mark it as a blockquote. - H2 sections (required): Organize your content into categories. At least one H2 is needed for valid structure.
-
Markdown links (required): List your key pages as
[Label](https://full-url). Use full absolute URLs.
# Your Company Name
> One or two sentences describing what your company does and who it serves.
> Keep this concise - it is what AI models use to introduce your organization.
## Products
- [Product Name](https://yoursite.com/product): Brief description of what this product does.
- [Another Product](https://yoursite.com/other): What this one does and who it is for.
## Resources
- [Documentation](https://yoursite.com/docs): Technical documentation and API reference.
- [Blog](https://yoursite.com/blog): Articles and updates from the team.
## About
- [About us](https://yoursite.com/about): Company background, team, and mission.
- [Contact](https://yoursite.com/contact): How to get in touch.
llms-full.txt (optional)
You can also create an llms-full.txt
at your domain root containing the full text of your most important pages.
This gives AI systems deeper context without requiring them to crawl each
page individually. It is most useful for documentation sites, knowledge
bases, and content-heavy sites where page depth matters.
Beaconly checks for its existence but does not require it. It is marked as optional in the audit.
Practical tips
- Keep llms.txt under 100 KB. AI systems may truncate very large files.
- Use absolute URLs, not relative paths. AI systems may read the file out of context.
- Include your most linked-to and most important pages, not every URL on the site.
- Update it when you add major new content sections or products. It does not need to be updated constantly.
- Serve it as plain text (
Content-Type: text/plain). Most servers handle this automatically for .txt files.
04
Add JSON-LD structured data.
JSON-LD structured data tells AI systems who you are, what your site covers,
and whether your content is current. It uses the Schema.org vocabulary and
is placed in a
<script type="application/ld+json">
tag in your page head. It is not visible to users but is read by crawlers
that parse HTML.
The most effective pattern is an @graph
block that combines multiple node types in a single script tag. This lets
you link nodes to each other using
@id
references rather than duplicating data across blocks.
Organization node
The Organization node is the foundation. It establishes who operates the
site. Every other node can reference it by @id.
- @id: A stable URL identifier, by convention
https://yourdomain.com/#organization. Must be unique and permanent. - name: Your organization's name, exactly as you want it cited.
- url: Your canonical homepage URL.
- sameAs: An array of verified profile URLs (LinkedIn, GitHub, Crunchbase, etc.). AI systems use these to cross-reference your identity.
- knowsAbout: An array of strings describing your organization's areas of expertise or core topics.
FAQPage node
FAQPage schema is one of the highest-value schema types for AI citation. It structures your questions and answers in a format AI systems can read directly without parsing prose content. If your page has a FAQ section, mark it up.
WebPage node with speakable
The WebPage node signals the canonical URL, name, and modification date of
a page. The speakable
property, using SpeakableSpecification, tells AI systems which sections of
your page contain the most important content. Point it at CSS selectors for
your hero and key content sections.
Working example: complete @graph
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@graph": [
{
"@type": "Organization",
"@id": "https://yoursite.com/#organization",
"name": "Your Company Name",
"url": "https://yoursite.com/",
"sameAs": [
"https://www.linkedin.com/company/your-company/",
"https://github.com/yourorg"
],
"knowsAbout": ["your topic", "another topic", "core service area"]
},
{
"@type": "WebPage",
"@id": "https://yoursite.com/page#webpage",
"url": "https://yoursite.com/page",
"name": "Page Title | Site Name",
"description": "A clear description of what this page covers.",
"dateModified": "2026-04-13",
"speakable": {
"@type": "SpeakableSpecification",
"cssSelector": [".hero-content", ".main-content"]
}
},
{
"@type": "FAQPage",
"@id": "https://yoursite.com/page#faq",
"mainEntity": [
{
"@type": "Question",
"name": "Your first question?",
"acceptedAnswer": {
"@type": "Answer",
"text": "A complete, accurate answer to the question."
}
}
]
}
]
}
</script>
Update dateModified
whenever you make significant changes to a page. AI systems use this date
to judge content freshness. Stale dates or missing dates reduce the
likelihood of being cited in time-sensitive queries.
05
Page structure basics.
Beyond robots.txt, llms.txt, and schema, AI crawlers evaluate basic page signals to decide whether a page is a reliable, well-structured source. These are not AI-specific requirements, but they matter for AI discoverability the same way they matter for SEO.
Single H1 heading
Every page should have exactly one <h1>
tag. AI systems treat the H1 as the definitive title of the page. Zero H1s
leaves the page without a clear title. Multiple H1s create ambiguity about
what the page is actually about.
Meta description (120 to 160 characters)
The meta description is used as the default summary when AI systems cite your page. Keep it between 120 and 160 characters. Write it as a complete sentence that accurately describes the page. Descriptions that are too short lack useful context; too long and they get truncated.
Canonical URL
Add <link rel="canonical" href="https://yoursite.com/page">
to every page. This tells crawlers which URL is the authoritative version.
Without it, content accessible at multiple URLs (with and without trailing
slashes, with query strings, etc.) may be treated as duplicate pages.
Open Graph tags
Include og:title,
og:description,
and og:image
on every page. These tags are used when AI systems generate link previews and
citations in interfaces that display rich cards.
HTTPS
Serve your site over HTTPS. HTTP-only pages are deprioritized by modern crawl infrastructure. Free SSL certificates are available through most hosting providers via Let's Encrypt.
Server-rendered content
AI crawlers do not execute JavaScript. If your page content is rendered client-side only (React, Vue, Angular without SSR), crawlers see an empty or near-empty HTML shell. Key content must be present in the initial server-rendered HTML. Use server-side rendering or static site generation for pages you want AI systems to read.
Response time
Aim for sub-2-second response times. Crawlers have strict timeout budgets. Pages that are slow to respond may be skipped entirely, particularly for supplementary resources like robots.txt and sitemaps. CDN caching and static file serving are the most reliable ways to hit this target.
Free audit tool
Check your
current configuration.
Beaconly runs 35 checks against everything covered in this guide and shows you exactly what is missing, with specific fixes for each failed check.