Methodology
How Beaconly audits your site.
Beaconly fetches five resources from your domain and runs 35 checks across three tiers. Every check has a defined pass condition based on what AI crawlers actually require, not proxy signals or best-guess heuristics.
Step 1
Five resources, fetched in parallel.
For every audit, Beaconly makes five outbound requests from a Cloudflare Worker
running at the network edge. Each request has an 8-second timeout and follows
redirects. All fetches use the User-Agent
Beaconly Auditor/1.0 (+https://beaconly.orygn.tech)
so requests are identifiable in your server logs. Response bodies are truncated
at 512 KB. Private IP addresses and localhost are blocked.
Tier 1
AI Crawler Access
16 checks across robots.txt, llms.txt, and sitemap.xml. This tier determines whether AI bots can physically reach your content and whether you have given them a structured summary of what your site contains.
robots.txt checks
Wildcard does not count. A User-agent: * block
does not count as explicit AI crawler permission. Each bot must appear by name
in its own User-agent block with either an explicit
Allow: / or no Disallow: / directive.
-
robots.txt found Required
Passes if
/robots.txtreturns HTTP 200. AI crawlers check this file before crawling anything. A missing robots.txt means bots have no access rules to follow. -
robots.txt response time Required
Passes if robots.txt responds in under 2,000 ms without timing out. Crawlers have tight timeout budgets. A slow robots.txt causes bots to abandon the crawl before reading your access rules.
-
GPTBot allowed Required
Passes if robots.txt contains an explicit
User-agent: GPTBotblock withAllow: /or noDisallow: /. GPTBot is how OpenAI crawls for ChatGPT training and responses. -
ChatGPT-User allowed Required
Passes if robots.txt contains an explicit
User-agent: ChatGPT-Userblock. This is the crawler used when ChatGPT browses the web in real time, separate from GPTBot. -
ClaudeBot allowed Required
Passes if robots.txt contains an explicit
User-agent: ClaudeBotblock. ClaudeBot is Anthropic's crawler for Claude training and knowledge. -
PerplexityBot allowed Required
Passes if robots.txt contains an explicit
User-agent: PerplexityBotblock. PerplexityBot is a retrieval crawler: it converts at 111 crawls per referral, making it the highest-urgency bot to allow. -
Google-Extended allowed Required
Passes if robots.txt contains an explicit
User-agent: Google-Extendedblock. This is separate from Googlebot and controls access to Gemini and AI Overviews specifically. -
Applebot-Extended allowed Required
Passes if robots.txt contains an explicit
User-agent: Applebot-Extendedblock. Controls Apple Intelligence access, separate from standard Applebot. -
Bytespider allowed Required
Passes if robots.txt contains an explicit
User-agent: Bytespiderblock. ByteDance's crawler, used for AI products including TikTok search and recommendations. -
CCBot allowed Required
Passes if robots.txt contains an explicit
User-agent: CCBotblock. CCBot powers Common Crawl, the public dataset used across many AI training pipelines. -
cohere-ai allowed Required
Passes if robots.txt contains an explicit
User-agent: cohere-aiblock. Cohere's crawler for enterprise AI applications and retrieval systems. -
bingbot allowed Required
Passes if robots.txt contains an explicit
User-agent: bingbotblock. Microsoft's crawler for Bing, which powers Microsoft Copilot and Bing AI features. -
Meta-ExternalAgent allowed Required
Passes if robots.txt contains an explicit
User-agent: Meta-ExternalAgentblock. Meta's primary AI training crawler, used for Meta AI across Facebook, Instagram, and WhatsApp. -
Amazonbot allowed Required
Passes if robots.txt contains an explicit
User-agent: Amazonbotblock. Amazon's crawler used for AI products including Alexa and Rufus. -
DuckAssistBot allowed Required
Passes if robots.txt contains an explicit
User-agent: DuckAssistBotblock. DuckDuckGo's real-time crawler for AI-assisted answers that cite their sources. -
MistralAI-User allowed Required
Passes if robots.txt contains an explicit
User-agent: MistralAI-Userblock. Mistral's real-time retrieval crawler for Le Chat cited responses.
Example: correct robots.txt format for AI bots
# Each bot needs its own named block. Wildcard does not apply.
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
User-agent: cohere-ai
Allow: /
User-agent: bingbot
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: DuckAssistBot
Allow: /
User-agent: MistralAI-User
Allow: /
llms.txt checks
llms.txt is a Markdown file at your domain root that gives AI models a structured summary of your site. It follows a simple format: an H1 title, H2 sections organizing your content, and Markdown links pointing to important pages.
-
llms.txt found Required
Passes if
/llms.txtreturns HTTP 200. -
llms.txt has H1 title Required
Passes if the first line of llms.txt matches
# Title(a Markdown H1). This is the required opening of the llms.txt format and the primary label AI systems use to identify your site. -
llms.txt has H2 sections Required
Passes if llms.txt contains at least one line matching
## Section(a Markdown H2). Sections organize your content into categories AI models use to navigate your site. -
llms.txt has inline links Required
Passes if llms.txt contains at least one Markdown link in the format
[Label](https://...). Links give AI systems specific URLs to reference, not just a description. -
llms-full.txt found Optional
Passes if
/llms-full.txtreturns HTTP 200. This file can contain the full text of your most important pages, giving AI models richer context than the summary in llms.txt.
Example: minimal valid llms.txt
# Acme Corp
> B2B software for supply chain teams. We help manufacturers track inventory across facilities.
## Products
- [Acme Tracker](https://acmecorp.com/tracker): Real-time inventory tracking across warehouses.
- [Acme Reports](https://acmecorp.com/reports): Automated reporting and analytics dashboards.
## About
- [About us](https://acmecorp.com/about): Company background, team, and mission.
- [Contact](https://acmecorp.com/contact): Get in touch with our sales team.
Sitemap checks
-
sitemap.xml found Required
Passes if
/sitemap.xmlreturns HTTP 200. Without a sitemap, AI crawlers must discover pages by following links, which often means missing content entirely. -
Sitemap has lastmod dates Required
Passes if the sitemap XML contains at least one
<lastmod>element. AI crawlers use lastmod to prioritize fresh content. Without it, your pages cannot be ranked by recency.
Tier 2
Schema and Structured Data
8 checks. Beaconly extracts all
<script type="application/ld+json">
blocks from your page HTML and flattens any @graph
arrays into a flat node list. Checks are then run against specific node types
and properties. Invalid JSON blocks are silently skipped.
-
JSON-LD found Required
Passes if the page HTML contains at least one parseable
<script type="application/ld+json">block. -
Schema has @type Organization Required
Passes if at least one node in the flattened schema graph has
@type: "Organization". AI systems use this node to identify who operates the site and associate content with a brand. -
Schema has @id Required
Passes if the Organization node has a non-empty
@idstring. The @id creates a globally unique, stable identifier that AI systems use to link your content across pages and datasets. Convention:https://yourdomain.com/#organization. -
Schema has sameAs Required
Passes if the Organization node has a non-empty
sameAsstring or array. Links to your LinkedIn, GitHub, or other verified profiles. AI systems use sameAs to verify and enrich information about your organization. -
Schema has dateModified Required
Passes if any schema node has a non-empty
dateModifiedstring. AI systems use this field to judge content freshness. Use ISO 8601 format:2026-04-12. -
FAQPage schema present Required
Passes if at least one node has
@type: "FAQPage". FAQPage schema marks up questions and answers that AI systems use directly in answer generation. It is one of the highest-value schema types for AI citation. -
Speakable schema present Required
Passes if any schema node has a non-null
speakableproperty. SpeakableSpecification marks which page sections are best for AI to summarize. Without it, AI systems must guess which content to prioritize. -
knowsAbout present Required
Passes if the Organization node has a non-empty
knowsAboutstring or array. Describes your organization's areas of expertise. AI systems use it to accurately describe what your organization does in cited responses.
Tier 3
Page Structure
10 checks. Analyzes the HTML of the exact URL you submitted. Beaconly does not execute JavaScript, so all checks run against the server-rendered HTML returned by the initial request.
-
Single H1 heading Required
Passes if the page HTML contains exactly one
<h1>element. AI systems treat the H1 as the definitive page title. Zero H1s leaves the page untitled; multiple H1s create ambiguity. -
Meta description present Required
Passes if the page contains a non-empty
<meta name="description" content="...">tag. AI systems use the meta description as the default summary when citing your page. -
Meta description length Required
Passes if the meta description is between 120 and 160 characters. Descriptions in this range are most likely to be used as-is without being truncated or regenerated by AI systems.
-
Canonical URL set Required
Passes if the page contains a non-empty
<link rel="canonical" href="...">tag. Tells AI crawlers which URL is the authoritative version of this page. Without it, duplicate URL variants can dilute AI visibility. -
Open Graph title Required
Passes if the page contains a non-empty
<meta property="og:title">tag. Used by AI systems when generating link previews and citations. -
Open Graph description Required
Passes if the page contains a non-empty
<meta property="og:description">tag. Used as a fallback summary in AI-generated previews and social citations. -
Open Graph image Required
Passes if the page contains a non-empty
<meta property="og:image">tag. AI interfaces and link previewers use this when surfacing your content visually. -
HTTPS enabled Required
Passes if the submitted URL begins with
https://. AI crawlers and modern infrastructure deprioritize HTTP-only pages. HTTPS is a baseline signal for credibility. -
Page response time Required
Passes if the page responds in under 2,000 ms without timing out. Slow pages are more likely to be skipped by crawlers operating under strict timeout budgets.
-
Content without JavaScript Required
Passes if script tag content makes up less than 60% of raw HTML bytes. AI crawlers do not execute JavaScript. Pages with high script-to-HTML ratios likely deliver content only after client-side rendering, which crawlers cannot see.
Scoring
How your label is determined.
Each tier gets one of three labels based on the required checks in that tier. Optional checks (only llms-full.txt currently) do not affect the label.
All required checks in the tier pass. This means the tier is fully configured for AI discoverability according to the signals AI crawlers check.
At least one required check passes but not all. The tier is partially configured. Every failed required check includes a specific fix.
Zero required checks pass. The tier has no working configuration for AI discoverability. Common for sites that have never audited AI access separately from SEO.
Free audit tool
Run the audit on your site.
Beaconly runs all 35 checks in seconds and shows you exactly what is missing with specific fixes for every failed check.
Audit your site free