An AI engine cites a page when five things are true: it can retrieve the page, extract a self-contained answer from it, trust the claims, match the page to the buyer's actual question, and ideally corroborate it elsewhere. Most pages that fail to get cited fail at one specific stage — and because the engines build answers largely from companies' own pages, the failing stage is usually something the company can fix. Identifying which stage is failing is the whole game.
When a buyer asks a question, the engine doesn't pick a winner from a ranked list. It runs a sequence: it gathers candidate sources it can reach, reads them for an answer it can lift, judges which it trusts, assembles a response, and names some sources. A page can be excellent and still drop out at any one of those steps. Thinking in stages is useful because the fix depends entirely on where a page falls out. (For the technical detail of how retrieval-augmented generation works under the hood, see the RAG pipeline explained; this page is about what determines selection.)
Nothing downstream matters if the engine can't fetch the page. This means the AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) are allowed, the content is in server-rendered or static HTML rather than JavaScript that the parser may never execute, and — critically — the page is present in the index each engine reads. ChatGPT's web search leans on the Bing index, so a page strong in Google but thin in Bing can be unreachable to ChatGPT specifically. Retrievability is binary and comes first.
Engines favor content they can quote as a self-contained unit. A section that opens with a direct answer in its first two sentences, under a clear heading, is far easier to cite than the same fact buried mid-paragraph. Structure measurably helps: 68.7% of pages cited by AI engines use logical H1→H2→H3 hierarchy (ConvertMate, 2026), and controlled testing found direct-answer formatting and added statistics raised citation rates substantially (Princeton GEO, KDD 2024). Extractability is largely a formatting and clarity property — among the cheapest factors to fix.
Engines weight sources they can stand behind. Specific, sourced, verifiable claims beat vague superlatives every time — "certified to ASME BPE with 21 CFR Part 11 records" is citable; "industry-leading quality" is not, because nothing in it can be checked. Named authors, dates, and references all raise trust. This is why precise, evidenced pages get cited over polished but unsubstantiated ones.
This is the strongest single lever. The match between a page's vocabulary and the buyer's actual search language is the citation signal that survives statistical controls for domain authority (Discovered Labs, 2026; β=+0.37). A buyer's prompt is also decomposed into sub-questions the engine answers piece by piece, so a page that covers the adjacent questions — cost, requirements, comparisons — is eligible for more of them. Speaking the buyer's language, not internal product language, is what lets a smaller site get cited over a larger one.
When an engine can cross-check a claim or an entity across multiple independent sources, it cites with more confidence. Being referenced consistently across the web — under one coherent company name and description — strengthens this. Fragmented identity (the same firm appearing under two names) splits the signal and weakens it. Corroboration builds slowly, but consistency of identity is something a company can stop undermining immediately.
Because each cited page passes all five and each absent page fails at least one, "why aren't we cited?" has a specific, locatable answer. The SIGNALS framework scores pages across the dimensions behind these factors — Structure, Grounding, Alignment, Substantiation and the rest — to isolate the failing stage rather than guessing. The fix for a retrievability failure (an index gap) is nothing like the fix for an alignment failure (wrong vocabulary), so naming the stage is what makes the work efficient.
They retrieve reachable pages, extract self-contained answers, judge which sources they trust, match pages to the buyer's actual question, and prefer sources they can corroborate elsewhere. A page must clear all of these; most uncited pages fail at one specific stage.
Usually because your competitor's page clears a stage yours fails — most often query-language match (their page uses the buyer's words) or retrievability (they're in the index the engine reads and you aren't). It is rarely about which company is better; it's about which page is more legible to the engine.
Vocabulary alignment — matching the buyer's search language — is the signal that holds up even after controlling for domain authority, which is why a smaller site can be cited over a larger one. Retrievability is the prerequisite; alignment is the strongest lever among reachable pages.
Less than in traditional SEO. Page-level vocabulary alignment outweighs domain authority as a citation predictor, so high authority does not protect a page whose language doesn't match the query, and low authority doesn't disqualify a well-aligned one.
Yes — by making the right page the most retrievable, extractable, and query-aligned for a given question. The factors are largely on your own site, which is why citation is improvable rather than fixed.
A free PULSE assessment measures where you're cited and where you're invisible across the four engines, which points to the failing stage so the fix is targeted rather than guessed.
A free, PULSE-powered assessment maps where you're cited and where you're invisible across ChatGPT, Google AI Overview, Perplexity, and Claude — query by query.
Request a free visibility assessment →