How LLMs Decide Which Pages to Cite in Results

Most AEO audits I see measure the wrong layer.

They scan your page for schema, headings, semantic structure, and alt text, and a citation-readiness score pops out. The number looks scientific. The report is color-coded. The recommendations are tidy.

The number is mostly a retrieval-stage signal. Citation happens two stages later, at synthesis, and that’s where most B2B sites lose.

I’ve watched companies spend a quarter rebuilding their content for “AI readability” (better headings, FAQ schema on every page, summary blocks at the top) and still not get cited. Their pages got retrieved. They got re-ranked. Then the model wrote the answer and pulled three other sources instead.

This piece walks through what actually happens between a user typing a prompt into ChatGPT, Claude, Perplexity, or Gemini, and your domain showing up (or not) in the answer. Four stages. Different signals at each one. Most tools score the first two and ignore the third and fourth.

If you’ve read the breakdown in The 36 AI Search Visibility Factors, this is the mechanic-level layer underneath. That piece tells you what to check. This one tells you why each check matters in the pipeline.

Stage 1: Query understanding

When someone types a prompt, the LLM doesn’t search for that string. It rewrites it.

A prompt like “What’s the best RAG platform for a Series B B2B SaaS company?” becomes, internally, a set of intermediate representations: an embedding (a vector that captures meaning), an inferred intent class (comparison, decision stage), and often a small set of expanded sub-queries the model thinks will return useful context. ChatGPT and Perplexity both do this expansion openly. You can sometimes see the searches they fired off. Claude does it too, less visibly. Gemini hands the expansion to Google’s search infrastructure.

Why this matters for citation: your content has to be retrievable against the expanded query, not the user’s original phrasing. If a prospect asks for “RAG platform for Series B B2B SaaS” and the model expands that to “retrieval-augmented generation enterprise vendor comparison” plus “RAG implementation cost mid-market” plus “vector database production deployment,” your page only enters the candidate pool if it matches one of those rewrites.

This is why writing for the exact long-tail phrase you found in keyword research often misses. The model isn’t matching strings. It’s matching semantic intent.

The practical test: pick a prompt you want to be cited for. Run it in ChatGPT or Perplexity with browsing on. Look at the sub-queries the model actually issued. Those are the queries your content needs to be retrievable against. If your page is built around terminology that doesn’t appear in any of the expansions, you’ve already lost the round.

Stage 2: Retrieval

The expanded queries hit one or more retrieval indexes.

For ChatGPT and Copilot, that’s primarily Bing. For Perplexity, it’s their own crawler plus Bing. For Gemini and Google AI Overviews, it’s Google’s main index plus a separate AI-specific path. For Claude, it depends on whether the conversation triggered a web search. When it does, Claude uses a Brave-powered index plus its own retrieval.

Two retrieval methods run in parallel: keyword matching (does the page contain the exact terms?) and vector retrieval (is the page’s meaning close to the query’s meaning?). Modern engines blend both. Pure keyword retrieval was how SEO worked through 2022. Pure vector retrieval would mean any topically related page could surface. The blend keeps the system anchored to actual term presence while letting semantically related content compete.

What this means for your page:

The page has to be in the index. If GPTBot, ClaudeBot, PerplexityBot, or GoogleOther can’t crawl it, you’re out before retrieval begins. Robots.txt is the gate. I still see B2B sites disallowing AI bots by default.
The page has to be chunked usefully. Engines retrieve at the passage or chunk level, not the full-page level. A 4,000-word article that buries its definition of a key concept in paragraph 23 may have that paragraph retrieved while the rest is ignored. That’s fine, except paragraph 23 is usually surrounded by context that helps it make sense. The chunk that gets retrieved should stand on its own.
The page has to use the entities the model expects. Entity coverage matters more than keyword density. If you’re writing about RAG, the page should mention vector databases, embedding models, chunking strategies, retrieval, and answer synthesis. Not because you’re keyword-stuffing, but because the model uses entity co-occurrence as a relevance signal.

If a page passes Stage 2, it’s a candidate. Candidate ≠ cited. Most of the citation gap I see at B2B sites is between this stage and the next two.

Stage 3: Re-ranking

The retrieval stage usually returns more candidates than the model can use. ChatGPT’s browsing tool often pulls 8-15 sources. Perplexity’s pipeline retrieves dozens. The re-ranker decides which ones make it through.

Re-ranking applies authority and quality signals on top of relevance. These aren’t published, but the patterns are visible:

Domain authority. Pages on domains with stronger external signals (incoming links, mentions, citations in training data, traffic volume, and age) rank higher than pages on weak domains. This is roughly equivalent to traditional SEO domain authority but with different inputs. LLMs care about whether the domain is referenced by other authoritative content, not just whether it has backlinks.

Source freshness. For most prompts, recency is a positive signal. For evergreen topics, less so. Perplexity weights freshness heavily. Its pipeline assumes the user wants current information unless context suggests otherwise. This is why a piece I published in March can outrank a six-year-old industry reference for the same query on Perplexity but lose to that same reference on Claude.

Source diversity. The re-ranker tries not to surface ten pages from the same domain. If three of your pages got retrieved, usually one survives. Which one is the most “topically central”: the page that most directly answers the expanded query, not the deepest sub-topic page?

Format match. If the user asks a comparison question, the re-ranker prefers comparison-formatted content. If they asked “how to,” it prefers procedural content. A great explainer article can lose to a worse comparison table when the prompt was a comparison prompt.

A page can be the most relevant result in the index and still get cut here because three other pages have stronger domain signals, fresher dates, or better format matches. This is where a lot of optimization investment hits a ceiling. You can write the best page on a topic. The re-ranker will still favor the page from the better-known domain unless your domain is also strong.

Stage 4: Answer synthesis

This is the stage that decides what’s cited.

After re-ranking, the model has 4-8 sources it’s allowed to use. It doesn’t have to use all of them. It reads each one, drafts an answer in its head, and picks the sources whose content actually shows up in the answer.

Three things decide which sources get cited at synthesis:

Quotability. Does the source contain a sentence, statistic, or framing that the model wants to use? Pages built around concrete claims, named numbers, and specific assertions get cited more than pages built around abstractions. A page that says “RAG implementations under 50,000 documents typically cost $40-80K to build and $2-6K/month to operate” is more quotable than a page that says “RAG implementations vary in cost depending on scope.” The model has something to pull from in the first case.

Coverage. Does the source cover the specific facet of the question the model is answering? If the answer needs to address pricing, the cited source will be the one that addressed pricing, even if another source had a stronger general treatment of the topic.

Position. Cited sources show up in two different ways. Inline citations appear at the end of specific sentences in the body of the answer, attributing claims to a source. “See also” or “Sources” lists at the bottom of the answer have far less weight. A reader following an inline citation is acting on a specific claim. A reader scrolling to the source list at the bottom is doing reference checking. The first kind of citation drives traffic and trust. The second mostly doesn’t.

This is the stage your AEO measurement should focus on. Not whether you’re retrievable. Whether you’re synthesized or not.

B2B AI Visibility: Why Your Website Is Invisible to ChatGPT walks through the synthesis-stage gap in more detail. The short version: if you’re optimizing for stages 1 and 2 only, you’ll improve your retrieval rate without improving your citation rate, and you won’t be able to tell.

Source authority weighting: what LLMs actually trust

There’s a rough hierarchy in how LLMs weight sources during re-ranking and synthesis. It looks like this:

The brand’s own domain. When the model is answering a question about a specific company, product, or proprietary methodology, the brand’s own content is weighted heavily, assuming the page exists, is retrievable, and addresses the question. This is why publishing to your own blog matters more than publishing the same content to Medium for branded queries.
Authoritative third-party content. Industry publications, research firms, and well-cited analysts. A Gartner mention, a Forrester write-up, a piece in MIT Technology Review, or The Information all carry weight at the re-ranker. Less prestigious B2B publications still help, but less. This tier is where most of your “citation acquisition” work pays off if you’re a B2B vendor.
Aggregators and listicles. G2, Capterra, and Software Advice, “best X tools 2026” roundups. These get cited frequently but with lower weight per citation, and they tend to surface in comparison-intent queries. They’re useful for being in consideration sets, but rarely drive deep authority.
User-generated content. Reddit threads, Stack Overflow, Quora, and forums. Cited heavily on some platforms (Perplexity, Google AI Overviews historically) and less on others. The weighting here moves fast. Reddit lost most of its citation share on Perplexity in early 2026. I wrote about that pattern in Reddit Lost 86% of Its Citation Share on Perplexity in Three Months. This tier is unstable.
Social posts and short form. LinkedIn posts, X threads, and YouTube descriptions. Almost never cited as primary sources in answers. Occasionally, surface in “see also” lists.

The implication: if you’re a B2B vendor and your owned content (1) is thin, your citation rate caps out at whatever (2) and (3) say about you. That’s why I push hard on blog publishing as the canonical surface. Medium republishing extends reach but doesn’t substitute for blog presence on your own domain.

The Revenue Experts AI Citation Audit Method covers how we measure each tier and where most B2B sites are actually losing.

Citation reproducibility: the metric most audits skip

If you run the same prompt in the same LLM twice in a row, you usually don’t get the same answer. The citations vary, too.

Most AEO audits I see run a prompt once, screenshot the result, count whether the brand showed up, and move on. That single measurement is unreliable. The LLM has stochastic elements in retrieval ranking and synthesis. A source that appears in run 1 may not appear in run 2.

Citation reproducibility is the rate at which a source appears across multiple runs of the same prompt. A source cited in 5 out of 5 runs has high reproducibility. A source cited in 1 out of 5 has low reproducibility. That citation is essentially noise.

Reproducibility matters because it tells you something the single-run measurement can’t: whether your page is systematically surfacing or just lucky. A page that appears once in five runs is not a competitor’s threat. A page that appears five out of five times has effectively locked the slot.

In the Citation Audit Method, every prompt runs at least three times per LLM. Across four LLMs (ChatGPT, Claude, Perplexity, Gemini), that’s twelve measurements per prompt. Twelve measurements per prompt across a 50-prompt category sweep is 600 observations. That’s enough to distinguish signal from noise.

If your AEO tool shows a single run per prompt per LLM, the scores it produces are mostly noise on the borderline. Pages that look like they’re winning may not be. Pages that look like they’re losing may actually be partial winners.

Run your own check: pick a prompt where you got cited in the AI Overviews or ChatGPT answer last week. Re-run it three times today. Count how many runs cite you. If it’s three out of three, you have a defensible position. If it’s one out of three, you don’t have a position. You got lucky once.

What to measure instead

Five metrics actually correlate with citation outcomes. Most AEO tools track at most one or two.

Citation reproducibility rate. Of N runs of the same prompt, what percentage cite the source? Cohort: high (>70%), medium (30-70%), low (<30%). Measured across LLMs.

Inline vs. see-also position. Of cited appearances, what percentage are inline citations in the body of the answer vs. appearances in source lists? Inline citations correlate with measured referral traffic from AI sources at roughly 8-12x the rate of see-also appearances in my sample.

Multi-LLM coverage. Of the four major LLMs (ChatGPT, Claude, Perplexity, and Gemini), how many surface the brand for the target prompt? A brand cited in one LLM only is in a much weaker position than a brand cited in three or four. Concentration risk is real.

Competitor co-citation rate. When the brand is cited, which competitors are also cited in the same answer? If you’re always cited alongside the same three competitors, you’re in a defined consideration set. If you’re cited in isolation, you may be the default, or you may be in a small pool. This is information you can act on for positioning.

Open territory rate. What percentage of prompts in the category return no brand citations at all: only generic explainers, aggregators, or news sources? Open territory is where new entrants have the highest upside. These are the prompts to target first.

None of these metrics show up on a free AEO score. They require running the actual prompts multiple times across multiple LLMs, with a defined category and a defined competitor set.

The free check vs. the actual measurement

There are two reasonable starting points for understanding your AEO position.

The first is technical readiness. If your site isn’t crawlable by AI bots, isn’t structured for chunk-level retrieval, or has no authority signals on the page, you can’t be cited regardless of how well your category content is written. This is a baseline check, not an outcome measurement. The free 60-second AI Visibility Audit at Revenue Experts runs this. It checks your robots.txt against the major AI crawlers, scores citation readiness across the five categories, and outputs a prioritized fix list. If your readiness score is under 60, fix that first.

The second is citation measurement. This is what the methodology in this article is built around. You design 50 prompts across research, comparison, and decision intent, sized to your category, your competitors, and your buyer’s actual question patterns. You run each prompt three times across four LLMs. You score the outputs. You build the citation map. The $497 audit packages this with a 5-7 day turnaround and a per-prompt diagnosis explaining why you’re not being cited where you’re not.

The free audit answers: Is my site technically ready to be cited? The paid audit answers, “In my actual category, am I being cited—and against whom?”

You don’t run both at the same time. Run the free one first, fix anything obvious, then come back for the actual citation map once your technical baseline is solid.

For a deeper read on what the technical readiness layer actually contains, The 36 AI Search Visibility Factors walks through the full checklist. And if you’re sizing whether RAG infrastructure is the next layer for your stack, RAG: Build or Buy? A Cost Framework for B2B Leaders covers the math.

What this means for your AEO program

If you take one thing from this walkthrough, stop optimizing only for the layers your tools measure.

Schema markup, heading structure, semantic clarity, and chunk-level coherence all help. They get you into the candidate pool. They don’t decide whether you get cited. Citation gets decided at synthesis, and synthesis weights quotability, coverage, and source authority. None of which are visible in a 60-second scan.

The teams I see winning at AEO in 2026 are running multi-run citation tests against a defined prompt set, every four to six weeks and treating the resulting reproducibility scores as the actual KPI. They’re not chasing readiness scores. They’re chasing answers to “Are we systematically surfacing for the queries that matter?”

Get the weekly read

The Revenue Signal goes out every Thursday morning. One signal, one named B2B build, one move you can run before the next issue arrives. Drawn from real data and real company examples, written for the contract signer who approves AEO and RAG budgets but doesn’t have time to chase every benchmark.

This week’s issue takes the citation reproducibility framework above and walks through the named cases from the 2026 AI Visibility Benchmark—including how Clio reached 89 out of 100 across 1,400 buyer-intent prompts.

Subscribe to The Revenue Signal →

Elizabeta Kuzevska is Co-Founder and Fractional AI Search and RAG Strategy Advisor at Revenue Experts AI. She publishes weekly on AEO and RAG strategy for B2B SaaS companies and writes The Revenue Signal newsletter every Thursday.