Technical explainer · How AI citation works

The RAG Pipeline Explained: How ChatGPT, Claude, and Perplexity Decide What to Cite

When someone asks ChatGPT a question, it doesn't just make up an answer. It runs a four-stage retrieval process — fetching web pages, extracting content, ranking them, and generating a response. Most websites fail at stage 2 or 3. By the time AI evaluates content quality, your page has already been eliminated.

Understanding this pipeline is the foundation of Answer Engine Optimization (AEO). Once you see where your pages fail, the fixes are obvious. This is the model that drives the SIGNALS framework.

Victor Xu

Founder, SIGNALS · AI Visibility Intelligence

Updated June 2026

What is RAG?

RAG stands for Retrieval-Augmented Generation. It's the architecture that powers how ChatGPT (with web browsing), Claude, Perplexity, and Google AI Overviews answer questions using real web content rather than just training data.

The basic idea: instead of relying only on what the model learned during training, the AI retrieves fresh web pages relevant to the query, extracts the useful content, and generates a response grounded in that content. The pages it retrieved become the citations.

This is why AEO is fundamentally different from SEO. Google asks "which pages does the web trust?" RAG asks "which pages best answer this specific query right now?" Different question, different algorithm, different signals.

The four stages — where pages succeed and fail

Every RAG-based AI system runs some version of this pipeline. The specific implementation varies between ChatGPT, Perplexity, and Google AI Overviews, but the four-stage structure is consistent. Based on AgentGEO's 2026 pipeline failure analysis.

Retrieval

Can the AI crawler access and fetch this page?

✓ Most pages pass this stage

The AI system sends a crawler to fetch pages that might answer the query. For most publicly accessible websites, this stage isn't the problem — crawlers can reach the page.

Pages that fail here

→ Pages blocked by robots.txt for AI crawlers (GPTBot, ClaudeBot, PerplexityBot)
→ Pages behind login walls or paywalls
→ Pages with server errors or extremely slow load times
→ Pages that require session cookies to render

Check your robots.txt at signalscite.com/robots.txt — look for disallow rules targeting GPTBot, ClaudeBot, or anthropic-ai.

Retrieval isn't one pipeline — it's three. Each AI engine reads from a different search index, and being in one does not put you in the others:

→ ChatGPT crawls with its own bot (OAI-SearchBot) and falls back to Bing. Not in Bing's index, ChatGPT's fallback fails. Microsoft Copilot uses Bing too — so that's two surfaces gated by one index.
→ Claude runs its web search on Brave Search. Anthropic added Brave to its subprocessor list in March 2025, and independent analysis found Claude's citations match Brave's top results around 87% of the time. Not in Brave, Claude can't surface you. There's no Brave webmaster dashboard — you submit URLs one at a time at search.brave.com/submit-url.
→ Gemini and Google AI Overviews sit on Google's index.

The practical consequence: optimizing for Google alone leaves you invisible to Claude and to ChatGPT's primary path. "Getting indexed" means getting indexed in all three — and explicitly allowing the AI crawlers (OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended) in robots.txt and at your CDN/WAF. Roughly a third of business sites block at least one by accident through a default Cloudflare rule.

↓

Parsing

Can the AI extract clean, structured content from the HTML?

✗ Major failure point — especially for JS-heavy sites

Once fetched, the content must be extractable. The AI system reads the raw HTML and tries to identify: what is the main content, what is the heading structure, what are the key claims. Pages that make this hard — through JavaScript rendering, poor structure, or content buried in dynamic elements — get partially or fully eliminated here.

Pages that fail here

→ React/Vue/Angular SPAs without server-side rendering — crawlers get an empty shell
→ Content inside JavaScript-rendered tabs, modals, or accordions
→ No clear heading hierarchy — multiple H1s, skipped levels
→ Key content inside iframes or embedded widgets
→ Heavy navigation and footer content overwhelming the main content

Pages that pass here

→ Server-rendered HTML with real content in the page source
→ Clean H1→H2→H3 hierarchy that maps to content sections
→ Main content in semantic HTML elements (p, article, section)
→ FAQPage or Article JSON-LD schema giving explicit structure signals

↓

Ranking

Does this page rank above competitors in the retrieved set?

✗ The most common failure point — where vocabulary alignment decides

Your page is now competing against every other retrieved page for the same query. The AI system ranks them by semantic relevance — how well the page's vocabulary matches the query's intent and sub-queries. This is where most pages lose, not because of content quality, but because of language mismatch.

Discovered Labs (2026) found that vocabulary alignment is the only page-level signal with a documented causal effect independent of domain authority — β=+0.37. A small, well-aligned page consistently beats a major brand's page if its language better mirrors how buyers actually search.

Pages that fail here

→ Internal product terminology instead of buyer search language
→ H2s as category labels ("Our Services") instead of buyer questions ("How much does X cost?")
→ Opening paragraph doesn't directly answer the query
→ Page covers only the primary intent, missing adjacent sub-queries
→ No FAQ section covering query decompositions

Pages that pass here

→ Title and H1 mirror exact buyer search phrases
→ Opening paragraph answers the query directly in 2–3 sentences
→ H2s phrased as buyer questions covering multiple sub-intents
→ FAQ section addressing every decomposition of the main query

↓

Generation

Is the content quotable as a standalone unit?

⚠ Often the last mile failure

Your page made it to the generation stage — the AI is now writing its response using your content. But it needs sentences that can be quoted directly, without surrounding context. Vague, relative, or conditional statements get skipped. Specific, sourced, standalone claims get cited.

Princeton GEO (2024) found that adding sourced statistics increased citation frequency by 41%, and adding named expert quotes increased it by 28%. These aren't content quality improvements — they're citation unit improvements. They give the AI something to quote.

Content that fails here

→ "Our approach has been shown to significantly improve outcomes" — vague, unverifiable
→ "As mentioned in the previous section..." — requires context
→ "Results may vary depending on your situation" — conditional, non-quotable
→ "We're the industry leader in..." — unsourced superlative

Content that gets cited

→ "Enterprise VR training reduces learning time by 40% vs. classroom instruction (PwC, 2023)" — specific, sourced, standalone
→ "The average enterprise VR project takes 12–20 weeks from discovery to deployment" — factual, quotable
→ "Among the top AR development firms, Treeview specializes in enterprise HoloLens and Meta Quest builds" — verifiable, entity-rich

How the SIGNALS framework maps to the pipeline

Each of the 7 SIGNALS dimensions targets a specific stage of the RAG pipeline. This is why fixing the right dimension matters — there's no point improving Generation-stage signals if your page is failing at Parsing.

Stage 1
Retrieval

Newness (5%) — visible timestamps help crawlers assess recency. No dedicated dimension because retrieval failures are rare and binary — either the page is accessible or it isn't.

Stage 2
Parsing

Structure (12%) — heading hierarchy, self-contained sections, semantic HTML. The primary parsing signal. JSON-LD schema also helps here by giving explicit structural instructions.

Stage 3
Ranking

Alignment (35%) · Language (10%) · Intent (10%) — the three vocabulary signals. Alignment measures overall lexical overlap with buyer queries. Language focuses on title and opening paragraph specifically. Intent measures coverage of adjacent sub-queries.

Stage 4
Generation

Grounding (13%) · Substantiation (15%) — the citation unit signals. Grounding measures sourced statistics and expert quotes. Substantiation measures third-party validation that makes claims credible enough to cite.

The practical implication: fix in pipeline order

The most common mistake in AEO is optimizing the wrong stage. Businesses spend weeks improving their content quality (Generation stage) while their pages are failing at Parsing because they're built on a JavaScript SPA that renders an empty shell to crawlers.

The SIGNALS framework diagnoses which stage is failing first, so optimization happens in priority order. There's no point adding sourced statistics (Generation fix) to a page that AI systems can't extract content from (Parsing failure).

The correct order: fix Retrieval (accessibility) → fix Parsing (structure, SSR) → fix Ranking (vocabulary, FAQ, headings) → fix Generation (statistics, quotable claims). Most pages need work at Ranking — that's where vocabulary alignment lives, and it's the highest-leverage fix available.

Frequently asked questions

What is RAG (Retrieval-Augmented Generation)?

RAG is the four-stage process AI systems use to answer questions with real web content. Stage 1: Retrieval (fetch relevant pages). Stage 2: Parsing (extract clean content from HTML). Stage 3: Ranking (sort by relevance to the query). Stage 4: Generation (write a response from top-ranked content, citing sources). Most websites fail at Parsing or Ranking — before content quality is ever evaluated.

Why do most websites fail the RAG pipeline?

Usually structural issues, not content quality. The most common failures: JavaScript-only rendering (crawlers see an empty page), poor heading structure (content hierarchy can't be extracted), and vocabulary mismatch (internal language instead of buyer search terms). Pages that fix these structural signals consistently outperform better-written pages that don't.

How is RAG different from Google's algorithm?

Google's PageRank rewards backlinks and domain authority. RAG rewards structural clarity, vocabulary alignment, and quotable standalone content. A page can rank #1 on Google and fail the RAG pipeline entirely. The top-10 Google ranking and AI citation overlap collapsed from 75% in mid-2025 to 17-38% in early 2026 — they've decoupled completely.

Which AI systems use RAG?

ChatGPT (with web browsing enabled), Claude, Perplexity, Google AI Overviews, Microsoft Copilot, and Gemini all use RAG or similar retrieval architectures. The specific implementation varies but the four-stage structure — and the signals that predict citation — are consistent across all of them.

What is the most important RAG pipeline stage to fix?

Fix in order: first ensure your pages are server-rendered and accessible (Retrieval/Parsing), then fix vocabulary alignment (Ranking). Most pages that pass Retrieval and Parsing still fail at Ranking because of vocabulary mismatch — this is where Alignment's 35% weight in SIGNALS comes from. It's the highest-leverage fix for most businesses.

How does the SIGNALS framework diagnose RAG pipeline failures?

The SIGNALS framework assesses each of the four pipeline stages, identifies the primary failure point, and prioritizes fixes accordingly. It maps all 7 dimensions to pipeline stages — Structure and Language for Parsing/Ranking, Alignment and Intent for Ranking, Grounding and Substantiation for Generation. To see where your own pages stand, request a free visibility assessment.

Which search index does each AI engine use?

Three different indexes. Claude runs its web search on Brave. ChatGPT crawls with its own bot (OAI-SearchBot) and falls back to Bing, which Microsoft Copilot also uses. Gemini and Google AI Overviews read Google's index. Being indexed in one does not put you in the others — ranking #1 on Google does nothing for Claude, which reads Brave. Getting cited across AI means getting indexed in all three, and allowing each engine's crawler in robots.txt.

See where your company stands across all four engines.

A free, PULSE-powered visibility assessment maps exactly where you’re cited and where you’re invisible — against the competitors winning your category, category by category.

Request a free visibility assessment →

SIGNALS · A BlackSig Systems company