Retrieval inside an LLM is still retrieval
Open Perplexity. Type "why does the moon have phases". Hit enter.
The answer comes back as a couple of paragraphs of synthesised prose. Above the prose there’s a row of source chips: numbered citations, each linked to a real page, ordered by the model’s estimate of how much each contributed. Underneath, in some UIs, you can expand "Sources" and see the underlying retrieval set - a dozen or so URLs with titles and a snippet.
I want to spend a moment on that Sources panel, because if you’ve been building search infrastructure for any length of time, what you’re looking at is extremely familiar.
The pipeline, named
Here, in order, is what almost every "AI answer" surface does - Perplexity, ChatGPT with web access, Claude with search, Google’s AI Overview, the lot:
- Take the user’s prompt and turn it into one or more retrieval queries. Sometimes that’s a verbatim pass-through; usually it’s a rewrite or decomposition into multiple queries. (Agentic search loops do this multiple times.)
- Run those queries against an index to get a candidate set of documents or passages. The index is some flavour of lexical-plus-vector - sparse keyword matching for one signal, dense embedding similarity for another, hybrid scoring across both.
- Rerank the candidates using a stronger but more expensive model - a cross-encoder, a late-interaction scorer, or sometimes a small reranking LLM.
- Truncate to the top-K passages that fit in the eventual context window. Usually small - 5 to 20 passages, well below what the model could technically accept.
- Pass those passages to a large language model, along with the original prompt and an instruction to answer using them as evidence and cite them as it goes.
- Render the model’s output with the citation markers turned into clickable chips that link back to the retrieved passages.
That’s it. That’s the architecture inside the box that you can’t see.
If you’ve been reading the search engineering literature for the last decade, every single one of those steps has a name and a body of work behind it. Steps 2 and 3 are retrieval and reranking - the classical two-stage IR pipeline that goes back to TREC. Step 4 is truncation under budget, which is what every search system has done since memory cost more than zero. Step 5 is where the new part lives - but step 5 is downstream of everything else and depends on what arrives there.
The new bit is just the wrapper. The mechanics underneath are the same mechanics.
Each step is something the search community already argues about
I’ve spent the last fortnight writing about pieces of this pipeline as if they were independent concerns. Looking back, they were never independent - they were the same architecture under different names.
- The TurboQuant / Hadamard rotation work is about how to compress the dense-vector half of step 2 so you can store more of it on less hardware. It’s a step-2 optimisation.
- Late interaction is about doing step 3 better - preserving token-level signal so the reranker has more to work with than a pooled vector. ColBERT and its descendants are step-3 technology.
- The DiskBBQ filtered-search work is about making restrictive filters at step 2 cheap enough to be worth running in the first place.
- The xAI two-tower retrieval architecture is the same shape - bi-encoder for step 2, transformer for step 3 - just trained on user/post interactions instead of query/document relevance.
- And the whole "agentic search" frame is about iterating steps 1–4 - a planner reissuing queries, refining, and re-retrieving until it has enough context for step 5.
When you put them in a row like that, they stop looking like distinct research directions and start looking like different operating-point choices on the same pipeline. Which corner of the cost/quality/latency surface do you want to live in for each stage?
The pattern, again
I keep finding myself drawing the same shape. A cheap, broad, approximate stage in front of an expensive, narrow, exact stage. Retrieval in front of reranking. Reranking in front of generation. Generation in front of citation rendering. The pattern goes all the way down because at every layer of an AI answer there’s some thing that’s cheap to call a lot, and some thing you don’t want to call unless you have to.
The frontier LLM is the most expensive single component in this stack by a wide margin - a few dollars per million tokens at the high end, with serious latency budgets attached. Everything in front of it exists to make sure it only sees the candidates worth seeing. The corpus, the lexical index, the embedding index, the rotation, the quantisation, the reranker - all of that is "the cheap thing in front of the expensive thing", applied at this layer, with very large stakes attached because the expensive thing is now an LLM rather than a database read.
The retrieval engineering that was niche and unglamorous a few years ago turns out to be the load-bearing wall of the entire AI-answer surface. It’s the same engineering. It just matters more now, because what’s downstream of it is more expensive and more visible.
Why this matters for the series
I started this short series by writing about how my own search habits have shifted away from a list-of-links into a paragraph-with-citations. That shift is real, and it changes everything about who clicks what, what gets read, and what businesses can build on top of consumer search.
But it doesn’t change the engineering that produces the paragraph. If anything, the new world is more dependent on getting retrieval right than the old one was. When a search engine showed you ten links, an imperfect retrieval ranking gave you a #4 result that you could still find if you scrolled. When an LLM gives you one paragraph with three citations, an imperfect retrieval ranking is the difference between being mentioned and not being mentioned at all.
Which means the next thing worth thinking about, after "the SERP has changed", is: who is on the other end of the citation slot? How is that decision made? What signals push a document into the candidate set, into the top-K, into the citation chip?
That’s the question for the next post.
