Beaconly

Methodology

How Beaconly audits your site.

Beaconly fetches five resources from your domain and runs 35 checks across three tiers. Every check has a defined pass condition based on what AI crawlers actually require, not proxy signals or best-guess heuristics.

0 Resources fetched
0 Checks run
0 Tiers scored

Step 1

Five resources, fetched in parallel.

For every audit, Beaconly makes five outbound requests from a Cloudflare Worker running at the network edge. Each request has an 8-second timeout and follows redirects. All fetches use the User-Agent Beaconly Auditor/1.0 (+https://beaconly.orygn.tech) so requests are identifiable in your server logs. Response bodies are truncated at 512 KB. Private IP addresses and localhost are blocked.

/robots.txt AI crawler permission rules. Fetched from domain root, not the submitted URL path.
/llms.txt AI model context file. Structured summary of your site for language models.
/llms-full.txt Extended AI context. Full-text version of key pages. Optional but checked.
/sitemap.xml Page index for crawlers. Checked for existence and lastmod date entries.
[your URL] The exact URL submitted. Used for all page structure and schema checks.

Tier 1

AI Crawler Access

16 checks across robots.txt, llms.txt, and sitemap.xml. This tier determines whether AI bots can physically reach your content and whether you have given them a structured summary of what your site contains.

robots.txt checks

Wildcard does not count. A User-agent: * block does not count as explicit AI crawler permission. Each bot must appear by name in its own User-agent block with either an explicit Allow: / or no Disallow: / directive.

  • robots.txt found Required

    Passes if /robots.txt returns HTTP 200. AI crawlers check this file before crawling anything. A missing robots.txt means bots have no access rules to follow.

  • robots.txt response time Required

    Passes if robots.txt responds in under 2,000 ms without timing out. Crawlers have tight timeout budgets. A slow robots.txt causes bots to abandon the crawl before reading your access rules.

  • GPTBot allowed Required

    Passes if robots.txt contains an explicit User-agent: GPTBot block with Allow: / or no Disallow: /. GPTBot is how OpenAI crawls for ChatGPT training and responses.

  • ChatGPT-User allowed Required

    Passes if robots.txt contains an explicit User-agent: ChatGPT-User block. This is the crawler used when ChatGPT browses the web in real time, separate from GPTBot.

  • ClaudeBot allowed Required

    Passes if robots.txt contains an explicit User-agent: ClaudeBot block. ClaudeBot is Anthropic's crawler for Claude training and knowledge.

  • PerplexityBot allowed Required

    Passes if robots.txt contains an explicit User-agent: PerplexityBot block. PerplexityBot is a retrieval crawler: it converts at 111 crawls per referral, making it the highest-urgency bot to allow.

  • Google-Extended allowed Required

    Passes if robots.txt contains an explicit User-agent: Google-Extended block. This is separate from Googlebot and controls access to Gemini and AI Overviews specifically.

  • Applebot-Extended allowed Required

    Passes if robots.txt contains an explicit User-agent: Applebot-Extended block. Controls Apple Intelligence access, separate from standard Applebot.

  • Bytespider allowed Required

    Passes if robots.txt contains an explicit User-agent: Bytespider block. ByteDance's crawler, used for AI products including TikTok search and recommendations.

  • CCBot allowed Required

    Passes if robots.txt contains an explicit User-agent: CCBot block. CCBot powers Common Crawl, the public dataset used across many AI training pipelines.

  • cohere-ai allowed Required

    Passes if robots.txt contains an explicit User-agent: cohere-ai block. Cohere's crawler for enterprise AI applications and retrieval systems.

  • bingbot allowed Required

    Passes if robots.txt contains an explicit User-agent: bingbot block. Microsoft's crawler for Bing, which powers Microsoft Copilot and Bing AI features.

  • Meta-ExternalAgent allowed Required

    Passes if robots.txt contains an explicit User-agent: Meta-ExternalAgent block. Meta's primary AI training crawler, used for Meta AI across Facebook, Instagram, and WhatsApp.

  • Amazonbot allowed Required

    Passes if robots.txt contains an explicit User-agent: Amazonbot block. Amazon's crawler used for AI products including Alexa and Rufus.

  • DuckAssistBot allowed Required

    Passes if robots.txt contains an explicit User-agent: DuckAssistBot block. DuckDuckGo's real-time crawler for AI-assisted answers that cite their sources.

  • MistralAI-User allowed Required

    Passes if robots.txt contains an explicit User-agent: MistralAI-User block. Mistral's real-time retrieval crawler for Le Chat cited responses.

Example: correct robots.txt format for AI bots

# Each bot needs its own named block. Wildcard does not apply. User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: Bytespider Allow: / User-agent: CCBot Allow: / User-agent: cohere-ai Allow: / User-agent: bingbot Allow: / User-agent: Meta-ExternalAgent Allow: / User-agent: Amazonbot Allow: / User-agent: DuckAssistBot Allow: / User-agent: MistralAI-User Allow: /

llms.txt checks

llms.txt is a Markdown file at your domain root that gives AI models a structured summary of your site. It follows a simple format: an H1 title, H2 sections organizing your content, and Markdown links pointing to important pages.

  • llms.txt found Required

    Passes if /llms.txt returns HTTP 200.

  • llms.txt has H1 title Required

    Passes if the first line of llms.txt matches # Title (a Markdown H1). This is the required opening of the llms.txt format and the primary label AI systems use to identify your site.

  • llms.txt has H2 sections Required

    Passes if llms.txt contains at least one line matching ## Section (a Markdown H2). Sections organize your content into categories AI models use to navigate your site.

  • llms.txt has inline links Required

    Passes if llms.txt contains at least one Markdown link in the format [Label](https://...). Links give AI systems specific URLs to reference, not just a description.

  • llms-full.txt found Optional

    Passes if /llms-full.txt returns HTTP 200. This file can contain the full text of your most important pages, giving AI models richer context than the summary in llms.txt.

Example: minimal valid llms.txt

# Acme Corp > B2B software for supply chain teams. We help manufacturers track inventory across facilities. ## Products - [Acme Tracker](https://acmecorp.com/tracker): Real-time inventory tracking across warehouses. - [Acme Reports](https://acmecorp.com/reports): Automated reporting and analytics dashboards. ## About - [About us](https://acmecorp.com/about): Company background, team, and mission. - [Contact](https://acmecorp.com/contact): Get in touch with our sales team.

Sitemap checks

  • sitemap.xml found Required

    Passes if /sitemap.xml returns HTTP 200. Without a sitemap, AI crawlers must discover pages by following links, which often means missing content entirely.

  • Sitemap has lastmod dates Required

    Passes if the sitemap XML contains at least one <lastmod> element. AI crawlers use lastmod to prioritize fresh content. Without it, your pages cannot be ranked by recency.

Tier 2

Schema and Structured Data

8 checks. Beaconly extracts all <script type="application/ld+json"> blocks from your page HTML and flattens any @graph arrays into a flat node list. Checks are then run against specific node types and properties. Invalid JSON blocks are silently skipped.

  • JSON-LD found Required

    Passes if the page HTML contains at least one parseable <script type="application/ld+json"> block.

  • Schema has @type Organization Required

    Passes if at least one node in the flattened schema graph has @type: "Organization". AI systems use this node to identify who operates the site and associate content with a brand.

  • Schema has @id Required

    Passes if the Organization node has a non-empty @id string. The @id creates a globally unique, stable identifier that AI systems use to link your content across pages and datasets. Convention: https://yourdomain.com/#organization.

  • Schema has sameAs Required

    Passes if the Organization node has a non-empty sameAs string or array. Links to your LinkedIn, GitHub, or other verified profiles. AI systems use sameAs to verify and enrich information about your organization.

  • Schema has dateModified Required

    Passes if any schema node has a non-empty dateModified string. AI systems use this field to judge content freshness. Use ISO 8601 format: 2026-04-12.

  • FAQPage schema present Required

    Passes if at least one node has @type: "FAQPage". FAQPage schema marks up questions and answers that AI systems use directly in answer generation. It is one of the highest-value schema types for AI citation.

  • Speakable schema present Required

    Passes if any schema node has a non-null speakable property. SpeakableSpecification marks which page sections are best for AI to summarize. Without it, AI systems must guess which content to prioritize.

  • knowsAbout present Required

    Passes if the Organization node has a non-empty knowsAbout string or array. Describes your organization's areas of expertise. AI systems use it to accurately describe what your organization does in cited responses.

Tier 3

Page Structure

10 checks. Analyzes the HTML of the exact URL you submitted. Beaconly does not execute JavaScript, so all checks run against the server-rendered HTML returned by the initial request.

  • Single H1 heading Required

    Passes if the page HTML contains exactly one <h1> element. AI systems treat the H1 as the definitive page title. Zero H1s leaves the page untitled; multiple H1s create ambiguity.

  • Meta description present Required

    Passes if the page contains a non-empty <meta name="description" content="..."> tag. AI systems use the meta description as the default summary when citing your page.

  • Meta description length Required

    Passes if the meta description is between 120 and 160 characters. Descriptions in this range are most likely to be used as-is without being truncated or regenerated by AI systems.

  • Canonical URL set Required

    Passes if the page contains a non-empty <link rel="canonical" href="..."> tag. Tells AI crawlers which URL is the authoritative version of this page. Without it, duplicate URL variants can dilute AI visibility.

  • Open Graph title Required

    Passes if the page contains a non-empty <meta property="og:title"> tag. Used by AI systems when generating link previews and citations.

  • Open Graph description Required

    Passes if the page contains a non-empty <meta property="og:description"> tag. Used as a fallback summary in AI-generated previews and social citations.

  • Open Graph image Required

    Passes if the page contains a non-empty <meta property="og:image"> tag. AI interfaces and link previewers use this when surfacing your content visually.

  • HTTPS enabled Required

    Passes if the submitted URL begins with https://. AI crawlers and modern infrastructure deprioritize HTTP-only pages. HTTPS is a baseline signal for credibility.

  • Page response time Required

    Passes if the page responds in under 2,000 ms without timing out. Slow pages are more likely to be skipped by crawlers operating under strict timeout budgets.

  • Content without JavaScript Required

    Passes if script tag content makes up less than 60% of raw HTML bytes. AI crawlers do not execute JavaScript. Pages with high script-to-HTML ratios likely deliver content only after client-side rendering, which crawlers cannot see.

Scoring

How your label is determined.

Each tier gets one of three labels based on the required checks in that tier. Optional checks (only llms-full.txt currently) do not affect the label.

Configured

All required checks in the tier pass. This means the tier is fully configured for AI discoverability according to the signals AI crawlers check.

Partial

At least one required check passes but not all. The tier is partially configured. Every failed required check includes a specific fix.

Not Configured

Zero required checks pass. The tier has no working configuration for AI discoverability. Common for sites that have never audited AI access separately from SEO.

Free audit tool

Run the audit on your site.

Beaconly runs all 35 checks in seconds and shows you exactly what is missing with specific fixes for every failed check.

Audit your site free