When someone asks ChatGPT a question, it doesn't just make up an answer. It runs a four-stage retrieval process — fetching web pages, extracting content, ranking them, and generating a response. Most websites fail at stage 2 or 3. By the time AI evaluates content quality, your page has already been eliminated.
Understanding this pipeline is the foundation of Answer Engine Optimization (AEO). Once you see where your pages fail, the fixes are obvious. This is the model that drives the SIGNALS framework.
RAG stands for Retrieval-Augmented Generation. It's the architecture that powers how ChatGPT (with web browsing), Claude, Perplexity, and Google AI Overviews answer questions using real web content rather than just training data.
The basic idea: instead of relying only on what the model learned during training, the AI retrieves fresh web pages relevant to the query, extracts the useful content, and generates a response grounded in that content. The pages it retrieved become the citations.
This is why AEO is fundamentally different from SEO. Google asks "which pages does the web trust?" RAG asks "which pages best answer this specific query right now?" Different question, different algorithm, different signals.
Every RAG-based AI system runs some version of this pipeline. The specific implementation varies between ChatGPT, Perplexity, and Google AI Overviews, but the four-stage structure is consistent. Based on AgentGEO's 2026 pipeline failure analysis.
The AI system sends a crawler to fetch pages that might answer the query. For most publicly accessible websites, this stage isn't the problem — crawlers can reach the page.
Check your robots.txt at signalscite.com/robots.txt — look for disallow rules targeting GPTBot, ClaudeBot, or anthropic-ai.
Retrieval isn't one pipeline — it's three. Each AI engine reads from a different search index, and being in one does not put you in the others:
The practical consequence: optimizing for Google alone leaves you invisible to Claude and to ChatGPT's primary path. "Getting indexed" means getting indexed in all three — and explicitly allowing the AI crawlers (OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended) in robots.txt and at your CDN/WAF. Roughly a third of business sites block at least one by accident through a default Cloudflare rule.
Once fetched, the content must be extractable. The AI system reads the raw HTML and tries to identify: what is the main content, what is the heading structure, what are the key claims. Pages that make this hard — through JavaScript rendering, poor structure, or content buried in dynamic elements — get partially or fully eliminated here.
Your page is now competing against every other retrieved page for the same query. The AI system ranks them by semantic relevance — how well the page's vocabulary matches the query's intent and sub-queries. This is where most pages lose, not because of content quality, but because of language mismatch.
Discovered Labs (2026) found that vocabulary alignment is the only page-level signal with a documented causal effect independent of domain authority — β=+0.37. A small, well-aligned page consistently beats a major brand's page if its language better mirrors how buyers actually search.
Your page made it to the generation stage — the AI is now writing its response using your content. But it needs sentences that can be quoted directly, without surrounding context. Vague, relative, or conditional statements get skipped. Specific, sourced, standalone claims get cited.
Princeton GEO (2024) found that adding sourced statistics increased citation frequency by 41%, and adding named expert quotes increased it by 28%. These aren't content quality improvements — they're citation unit improvements. They give the AI something to quote.
Each of the 7 SIGNALS dimensions targets a specific stage of the RAG pipeline. This is why fixing the right dimension matters — there's no point improving Generation-stage signals if your page is failing at Parsing.
The most common mistake in AEO is optimizing the wrong stage. Businesses spend weeks improving their content quality (Generation stage) while their pages are failing at Parsing because they're built on a JavaScript SPA that renders an empty shell to crawlers.
SIGNALS diagnoses which stage is failing first, then generates fixes in priority order. There's no point adding sourced statistics (Generation fix) to a page that AI systems can't extract content from (Parsing failure).
The correct order: fix Retrieval (accessibility) → fix Parsing (structure, SSR) → fix Ranking (vocabulary, FAQ, headings) → fix Generation (statistics, quotable claims). Most pages need work at Ranking — that's where vocabulary alignment lives, and it's the highest-leverage fix available.
RAG is the four-stage process AI systems use to answer questions with real web content. Stage 1: Retrieval (fetch relevant pages). Stage 2: Parsing (extract clean content from HTML). Stage 3: Ranking (sort by relevance to the query). Stage 4: Generation (write a response from top-ranked content, citing sources). Most websites fail at Parsing or Ranking — before content quality is ever evaluated.
Usually structural issues, not content quality. The most common failures: JavaScript-only rendering (crawlers see an empty page), poor heading structure (content hierarchy can't be extracted), and vocabulary mismatch (internal language instead of buyer search terms). Pages that fix these structural signals consistently outperform better-written pages that don't.
Google's PageRank rewards backlinks and domain authority. RAG rewards structural clarity, vocabulary alignment, and quotable standalone content. A page can rank #1 on Google and fail the RAG pipeline entirely. The top-10 Google ranking and AI citation overlap collapsed from 75% in mid-2025 to 17-38% in early 2026 — they've decoupled completely.
ChatGPT (with web browsing enabled), Claude, Perplexity, Google AI Overviews, Microsoft Copilot, and Gemini all use RAG or similar retrieval architectures. The specific implementation varies but the four-stage structure — and the signals that predict citation — are consistent across all of them.
Fix in order: first ensure your pages are server-rendered and accessible (Retrieval/Parsing), then fix vocabulary alignment (Ranking). Most pages that pass Retrieval and Parsing still fail at Ranking because of vocabulary mismatch — this is where Alignment's 35% weight in SIGNALS comes from. It's the highest-leverage fix for most businesses.
SIGNALS audits each of the four pipeline stages for any URL, identifies the primary failure point, and generates fixes in priority order. It scores all 7 dimensions that map to pipeline stages — Structure and Language for Parsing/Ranking, Alignment and Intent for Ranking, Grounding and Substantiation for Generation. Free for 1 page, results in under 60 seconds.
Three different indexes. Claude runs its web search on Brave. ChatGPT crawls with its own bot (OAI-SearchBot) and falls back to Bing, which Microsoft Copilot also uses. Gemini and Google AI Overviews read Google's index. Being indexed in one does not put you in the others — ranking #1 on Google does nothing for Claude, which reads Brave. Getting cited across AI means getting indexed in all three, and allowing each engine's crawler in robots.txt.
Related reading
SIGNALS audits all 4 pipeline stages for any page, identifies exactly where it fails, and generates fixes in the right order. Free for 1 page — no account required.
Diagnose my pipeline free →No account · No credit card · Results in under 60 seconds