AEO Guide · Mechanism

How Do AI Engines Choose Which Sources to Cite?

Victor Xu

Founder, SIGNALS · AI Visibility Intelligence

Updated June 2026

TL;DR

An AI engine cites a page when five things are true: it can retrieve the page, extract a self-contained answer from it, trust the claims, match the page to the buyer's actual question, and ideally corroborate it elsewhere. Most pages that fail to get cited fail at one specific stage — and because the engines build answers largely from companies' own pages, the failing stage is usually something the company can fix. Identifying which stage is failing is the whole game.

The citation pipeline, in plain terms

When a buyer asks a question, the engine doesn't pick a winner from a ranked list. It runs a sequence: it gathers candidate sources it can reach, reads them for an answer it can lift, judges which it trusts, assembles a response, and names some sources. A page can be excellent and still drop out at any one of those steps. Thinking in stages is useful because the fix depends entirely on where a page falls out. (For the technical detail of how retrieval-augmented generation works under the hood, see the RAG pipeline explained; this page is about what determines selection.)

Factor 1 — Retrievability: can the engine reach the page at all?

Nothing downstream matters if the engine can't fetch the page. This means the AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) are allowed, the content is in server-rendered or static HTML rather than JavaScript that the parser may never execute, and — critically — the page is present in the index each engine reads. ChatGPT's web search leans on the Bing index, so a page strong in Google but thin in Bing can be unreachable to ChatGPT specifically. Retrievability is binary and comes first.

Factor 2 — Extractability: can the engine lift a clean answer?

Engines favor content they can quote as a self-contained unit. A section that opens with a direct answer in its first two sentences, under a clear heading, is far easier to cite than the same fact buried mid-paragraph. Structure measurably helps: 68.7% of pages cited by AI engines use logical H1→H2→H3 hierarchy (ConvertMate, 2026), and controlled testing found direct-answer formatting and added statistics raised citation rates substantially (Princeton GEO, KDD 2024). Extractability is largely a formatting and clarity property — among the cheapest factors to fix.

Factor 3 — Trust: are the claims verifiable?

Engines weight sources they can stand behind. Specific, sourced, verifiable claims beat vague superlatives every time — "certified to ASME BPE with 21 CFR Part 11 records" is citable; "industry-leading quality" is not, because nothing in it can be checked. Named authors, dates, and references all raise trust. This is why precise, evidenced pages get cited over polished but unsubstantiated ones.

Factor 4 — Query-language match: does the page speak the buyer's words?

This is the strongest single lever. The match between a page's vocabulary and the buyer's actual search language is the citation signal that survives statistical controls for domain authority (Discovered Labs, 2026; β=+0.37). A buyer's prompt is also decomposed into sub-questions the engine answers piece by piece, so a page that covers the adjacent questions — cost, requirements, comparisons — is eligible for more of them. Speaking the buyer's language, not internal product language, is what lets a smaller site get cited over a larger one.

Factor 5 — Corroboration: is the source confirmed elsewhere?

When an engine can cross-check a claim or an entity across multiple independent sources, it cites with more confidence. Being referenced consistently across the web — under one coherent company name and description — strengthens this. Fragmented identity (the same firm appearing under two names) splits the signal and weakens it. Corroboration builds slowly, but consistency of identity is something a company can stop undermining immediately.

Why this maps to a fixable diagnosis

Because each cited page passes all five and each absent page fails at least one, "why aren't we cited?" has a specific, locatable answer. The SIGNALS framework scores pages across the dimensions behind these factors — Structure, Grounding, Alignment, Substantiation and the rest — to isolate the failing stage rather than guessing. The fix for a retrievability failure (an index gap) is nothing like the fix for an alignment failure (wrong vocabulary), so naming the stage is what makes the work efficient.

Frequently asked questions

How do AI engines decide what to cite?

They retrieve reachable pages, extract self-contained answers, judge which sources they trust, match pages to the buyer's actual question, and prefer sources they can corroborate elsewhere. A page must clear all of these; most uncited pages fail at one specific stage.

Why does an AI engine cite my competitor instead of me?

Usually because your competitor's page clears a stage yours fails — most often query-language match (their page uses the buyer's words) or retrievability (they're in the index the engine reads and you aren't). It is rarely about which company is better; it's about which page is more legible to the engine.

What is the most important factor in getting cited?

Vocabulary alignment — matching the buyer's search language — is the signal that holds up even after controlling for domain authority, which is why a smaller site can be cited over a larger one. Retrievability is the prerequisite; alignment is the strongest lever among reachable pages.

Does domain authority decide AI citations?

Less than in traditional SEO. Page-level vocabulary alignment outweighs domain authority as a citation predictor, so high authority does not protect a page whose language doesn't match the query, and low authority doesn't disqualify a well-aligned one.

Can I influence which of my pages an engine cites?

Yes — by making the right page the most retrievable, extractable, and query-aligned for a given question. The factors are largely on your own site, which is why citation is improvable rather than fixed.

How do I find out which factor is failing for my pages?

A free PULSE assessment measures where you're cited and where you're invisible across the four engines, which points to the failing stage so the fix is targeted rather than guessed.

Related guides

The RAG pipeline explained — the technical mechanism behind these factors.
How to get cited by ChatGPT and how to appear in Google AI Overviews — applying the factors per engine.
AI citation checklist: 25 signals that matter.
The 7 SIGNALS dimensions — the scored version of these factors.