A primer on late interaction

Search relevance has been quietly bifurcating for years. On one side, classical sparse retrieval - BM25, the lexical workhorse that still runs production at most companies that have ever bothered to measure. On the other, dense retrieval - bi-encoder models that embed queries and documents into a single shared vector space and compare via cosine. Each has obvious strengths and obvious weaknesses. Sparse retrieval is fast, transparent, and brittle to lexical mismatch. Dense retrieval is robust to paraphrase but compresses every passage to a single vector, which throws away most of the structure of the text.

There’s a third regime that doesn’t get talked about enough outside research circles, and I think it’s overdue for wider attention: late interaction.

Three ways to compare a query to a document

The cleanest framing I’ve seen poses it as a question of when the query meets the document:

No interaction (single-vector bi-encoders): encode each side independently into one vector, then compare. Pre-computation is trivial; you can store a billion document vectors and answer queries in milliseconds. The cost is information loss - a 768-dimensional pooled vector is a summary, and you can’t un-summarise.
Early interaction (cross-encoders): feed the query and document together into a transformer that does full cross-attention. The model sees everything; the matching can be arbitrarily nuanced. The cost is that you have to run the model fresh for every (query, candidate) pair. That’s untenable beyond rerank-the-top-100 scenarios.
Late interaction: encode each side independently - like a bi-encoder - but keep per-token embeddings instead of a single pooled vector. At query time, compare them at the token level. You get most of the cross-encoder’s nuance with most of the bi-encoder’s pre-computation cost.

The canonical implementation is ColBERT - “Contextualised Late Interaction over BERT” - published by Omar Khattab and Matei Zaharia at SIGIR 2020.

How ColBERT actually scores a document

The mechanics are surprisingly clean.

Tokenise the query and the document. The query gets a [Q] marker token and is padded with [mask] tokens to a fixed length (typically 32). The document gets a [D] marker and runs to its natural length (typically capped at ~180 tokens).
Encode both sides through the same BERT model. Each token produces a 768-dimensional contextualised embedding. ColBERT projects those down to 128 dimensions to save space.
Score using the MaxSim operator. For each query token, compute its similarity (dot product) with every document token, and keep only the maximum. Then sum those maxima across the query tokens. Formally:

Score(q, d) = Σᵢ maxⱼ qᵢ · dⱼ

That’s it. The whole scoring function in one line. The intuition is that each query token gets to vote for its best match in the document, and the final relevance is the sum of those votes. Tokens that don’t find a strong match contribute little. Tokens that do - the rare, content-bearing ones - dominate the score.

There’s something architecturally elegant about it. You preserve the full token-level signal until the very last step, and the matching itself is just a max and a sum - cheap operations that vectorise beautifully on modern hardware.

Why this matters for search

The headline number from the original paper still impresses me. At re-ranking depth k=10, ColBERT uses approximately 180x fewer FLOPs than a cross-encoder BERT ranker. At k=1000 it’s roughly 13,900x fewer. Despite that, ColBERT’s Recall@50 on MS MARCO exceeds BM25’s Recall@1000 - it recovers more relevant passages in 50 results than BM25 does in a thousand.

In production terms, the practical benefits compound:

Robustness to paraphrase, like a dense retriever. The model has learned semantic similarity from a giant corpus.
Token-level precision, like a sparse retriever. Important query tokens still have to find their match somewhere in the document - they can’t be diluted by averaging.
Better out-of-domain performance than single-vector models. The MaxSim mechanism is less sensitive to distribution shift in how passages are pooled, because there’s no pooling.
Explainability. You can ask “which document tokens did each query token match against?” and get a meaningful answer. That’s something single-vector dense retrieval doesn’t give you.

The last point is underrated. Search teams spend an enormous amount of time debugging “why did this document score so highly?” With late interaction, the MaxSim alignment is a built-in explanation.

The catch: storage

There’s always a catch. With ColBERT it’s storage.

A single-vector encoder produces one 768-dimensional float vector per document - about 3 KB. ColBERT produces ~180 vectors per document at 128 dimensions each - roughly 90 KB per document, thirty times more. Scale that to MS MARCO (8.8M passages) and the original ColBERT index weighs in at 154 GB.

ColBERTv2, published a year later by the same group, made this problem tractable. Two techniques:

Residual compression: cluster the token embeddings into a small number of centroids, store the centroid IDs at full precision, and store the residual (vector − centroid) at low precision - typically 1 or 2 bits per dimension. The MS MARCO index drops from 154 GB to 16 GB at 1-bit, or 25 GB at 2-bit. That’s a 6–10x reduction with minimal accuracy loss.
Denoised supervision: train on cross-encoder distillation signals and mined hard negatives. The result is a model that’s not just smaller but materially more accurate than the original ColBERT.

PLAID followed in 2022 - an indexing engine built around ColBERTv2 that uses centroid pruning to skip most documents at retrieval time, bringing latency into the same ballpark as a bi-encoder.

The storage tax is real, but it’s gone from “absurd” to “tractable” in the space of two papers. For most production workloads where retrieval quality matters more than disk cost, that’s a trade-off worth making.

Beyond text

The frame is generalising. ColPali, released last year, applies late interaction to visual content - encoding document images with a vision-language model and matching against query token embeddings. The result is searchable PDFs without OCR, charts retrievable by description, slides findable by content. Same MaxSim, different encoder. The architectural pattern travels.

Why I keep coming back to it

Most teams I talk to are using single-vector dense retrieval today, often hybrid-stacked with BM25. That works. But there’s a quality ceiling baked into the pooling step - when you compress a passage to one vector, you’ve made a permanent decision about what to throw away. Late interaction defers that decision until the query is in front of you, and the resulting quality lift on out-of-domain queries is consistent enough that I’d reach for it for any retrieval problem where the corpus is reasonably stable and disk cost isn’t the binding constraint.

There are no silver bullets in search, we should always be mindful of the tradeoffs. The storage and infra tax is real, currently. The infrastructure to serve it is heavier than a flat vector index. But as a tool in the retrieval toolkit, late interaction is undersold - and the gap between research and production deployment is narrowing fast.