xAI algorithm through a search lens

1. Two-tower’s quiet dominance

The Phoenix retrieval model - the bit that finds out-of-network posts for your feed from a global corpus - is a two-tower bi-encoder. User embeddings on one side, post embeddings on the other, dot product to score. The mini model they ship is 256-dimensional, 2 transformer layers, 4 attention heads. The full archive is ~3 GB.

Read those numbers next to what the search community has been debating recently. Late interaction’s 180 vectors per document. TurboQuant’s 3.5-bits-per-channel quantisation bounds. Hadamard rotations, anisotropic loss, MUVERA. All real, all interesting, all worth the column inches they’ve received.

And the highest-traffic feed in the world ships plain dot product on 256-dim vectors.

There’s something humbling in that. At Twitter-scale traffic, the operational cost of every byte and every microsecond is paid out of a budget that academic benchmarks don’t price in. 256 dimensions is the calibration that makes retrieval economically viable when you’re serving billions of feed requests a day. Late interaction at ~180 vectors per document at that scale is a non-starter - you’d be storing trillions of vectors, not billions, and your re-ranker would never see the candidate set in time.

The lesson isn’t “fancy retrieval is wrong.” It’s that the operating points are wildly different. The papers I’ve been reading optimise for quality at fixed cost. The production system optimises for cost at acceptable quality. Both are valid, and the gap between them tells you something about which problem your team is actually solving.

2. The retrieve/rank split survived the bitter lesson

Here’s the line from the README that stopped me:

We have eliminated every single hand-engineered feature and most heuristics from the system. The Grok-based transformer does all the heavy lifting by understanding your engagement history.

That’s the bitter lesson made manifest in a production system. Two decades of recsys research built on careful feature engineering - affinity scores, decay functions, social graph signals, content embeddings - and the latest iteration of one of the world’s biggest feeds has thrown them all in the bin in favour of letting a transformer eat the raw engagement history.

The pipeline still has a separate retrieval stage (two-tower, dot product, top-K candidates) before the ranking stage (Phoenix transformer predicting P(like), P(reply), P(repost) and so on). Features are gone. The retrieve/rank split is not.

This is the pattern I keep finding myself drawing lately. A cheap, broad, approximate filter in front of an expensive, narrow, exact scorer. The cheap thing is allowed to be wrong - it just has to be cheap enough and recall-y enough that the expensive thing only gets the candidates worth examining. xAI’s system is the same shape as a classical search pipeline. Bi-encoder for retrieval, transformer for ranking. Inverted index for retrieval, learned ranker for ranking. Two-tower for retrieval, cross-encoder for ranking. All variations on the same separation-of-concerns instinct.

You can replace the features with a transformer. You can replace the candidate set with embeddings. You can replace the ranker with a model that has no idea what a "like" is until it’s seen a few billion of them. You don’t replace the shape. The pattern, once again, goes all the way down.

3. Recsys and search are converging architecturally

Recsys and search have always been deeply linked. In search we started actively discussing this link way back when Doug and John wrote "Relevant Search" (which still holds up today! It argued that search is simply a recommender that accepts free-text input).

Look at the Phoenix pipeline as a series of stages:

Query Hydrators → Candidate Sources → Hydrators → Filters → Scorers → Selector → Post-Selection Filters

Now look at a modern enterprise search pipeline:

Query understanding → Candidate retrieval (lexical + dense) → Filters (ACL, freshness, locale) → Re-ranking (cross-encoder or learned ranker) → Top-K → Final filters (safety, dedup)

These are the same diagram with different node labels. The "query" in the xAI system isn’t a string you typed - it’s your hydrated engagement history, attached to a request - but everything downstream of that handover is structurally identical to a search system. You retrieve candidates, you enrich them, you remove the ones you can’t show, you score what’s left, you pick the best, you double-check the picks.

The two disciplines used to be quite separate. Search people thought about TF-IDF, query parsers, relevance tuning. Recsys people thought about collaborative filtering, matrix factorisation, click-through curves. Twenty years on, the engineering scaffolding underneath both is the same scaffolding. Retrieve a candidate set with something cheap; score it with something expensive; filter aggressively at every stage. The fact that one starts from a string and the other from a behavioural fingerprint is a difference at the input boundary, not a difference of architecture.

I’ve had this conversation a few times recently with engineers moving between search and recsys roles. The vocabulary differs. The shape doesn’t. If you’ve built one, you can build the other - the transferable skill isn’t "I know how to do search" or "I know how to do recsys", it’s "I know how to design a retrieve-then-rank system at scale and reason about where to spend the latency budget."

A closing thought

The release reads, to a search practitioner, less like a recsys novelty and more like a confirmation of patterns we already use. Two-tower is alive and well. Retrieve/rank survives the bitter lesson. Recsys and search are the same pipeline with different inputs.

The interesting question isn’t whether your stack should look like this. It probably already does, or it should. The interesting question is which boundary in that pipeline is the right place to spend your team’s next engineering quarter - better retrieval, better re-ranking, better filtering, better understanding of what the "query" actually is. xAI’s release reminds me that those decisions are independent. You can change one without changing the others. That’s the value of the pattern surviving: it gives you a stable scaffold to iterate against.