SMART: late interaction without retraining

A new paper out of UW–Madison and Korea University this week - Your Embedding Model is SMARTer Than You Think - makes a claim I want to spend a few hundred words on, because it cleanly removes a barrier I'd treated as harder than it actually was.

The claim, said plainly: the per-token signal late interaction needs is already sitting inside your existing single-vector embedder. You can wire it in at inference for a small boost, or train a tiny adapter on top to unlock most of what a dedicated multi-vector model would give you.

The mechanism is small. Take an off-the-shelf single-vector embedder - the kind that produces one pooled vector per query and per document. Run a query through it as normal. But instead of throwing away the per-token hidden states from the final layer, keep them. Do the same for documents. At scoring time, compute a ColBERT-style MaxSim over those hidden states - sum, for each query token, the best similarity it finds among the document's tokens - and add it to the standard single-vector dot product:

Score(q, d) = single_vector(q, d) + MaxSim(hidden(q), hidden(d))

No retraining. No new parameters. No new model family. The authors call this SMART - Single-to-Multi Adaptation for Retrieval Transformers - and the inference-only version is a small drop-in for an existing retrieval stack; code and weights are on GitHub.

Why this matters as a continuation of the thread

When I wrote the primer on late interaction last year, the catch I dwelled on was storage. ColBERT-style scoring needs per-token vectors, which makes the index ~30x bigger than a single-vector one. ColBERTv2 and PLAID made that tractable; the TurboQuant and late-interaction-follow-up pieces earlier this month have been about whether modern quantisation pushes the storage tax low enough to deploy at scale.

What I underweighted in all of those is that late interaction had a second deployment barrier, and the two have different shapes. Storage is an ongoing cost, proportional to your corpus. The model-swap was a one-time engineering project. I'd been treating the ongoing cost as the binding constraint, but on reflection the one-time cost is a tougher sell for more teams than I'd given it credit for. To use late interaction at all, you needed a model trained for it: ColBERT, ColBERTv2, ColPali, jina-embeddings-v4. Different training objective, different checkpoint, different pipeline. If your team had already deployed a single-vector embedder and tuned a corpus around it, adopting late interaction meant adopting a whole new model family and re-encoding everything.

SMART removes that second barrier entirely. The model you have already works. You just have to read out of it differently.

The insight underneath

The bit I find most satisfying isn't the trick - it's the why the authors give for it working. Contrastive training of a single-vector embedder, they argue, also organises the underlying token-level hidden states, through gradient flow. The pooling operation isn't a hard wall; backprop flows through it. When you train the model to put right answers near queries in the pooled embedding, you're implicitly teaching the underlying tokens to align too - because the pooled vector is just a function of them.

It's a "the information was always there" result. The per-token hidden states have been computed on every forward pass, used to produce the pooled vector, and then discarded. SMART says: they're not noise. They've been trained - indirectly, via the pooled loss - enough to be useful when you tap them.

That's an architectural observation as much as an algorithmic one, and it generalises the way good observations do: the obvious interface of a model isn't always the most useful one, and the intermediate state of a system trained for X often carries enough signal for adjacent X′ if you bother to look.

The numbers, with the right caveats

Headline gains from inference-only SMART on the MMEB-V2 benchmark:

  • VLM2Vec-V2.0: +2.54% average
  • Qwen3-VL-Embedding-2B: +0.90%
  • Qwen3-VL-Embedding-8B (the SoTA model): +0.51%

Modest, and clearly diminishing on the strongest model. Worth being honest: this isn't a quality revolution at the top of the leaderboard. It's free retrieval quality on top of what you already have, with the most headroom for teams using smaller or off-the-shelf embedders.

The lightweight-adapter results are more striking. With ~2 hours of training on a single GPU node, Qwen3-VL-Embedding-2B + a SMART adapter scores 81.25 on visual-document retrieval - edging out jina-embeddings-v4, an actual dedicated multi-vector model, at 80.91. With a fraction of the compute of training a new multi-vector model, you can match or beat one using a single-vector base.

And there's a toy benchmark worth noting, with a caveat baked in. On a controlled "code-marker binding" task, designed specifically to need local token-level matching rather than pooled gist, the pooled single-vector score gets 31.9% and MaxSim over the same model's hidden states gets 56.8%. The local information was there, and the pooling was throwing it away - though predictably, this is exactly the kind of task late interaction is built to win, so treat it as supporting evidence rather than a general claim.

A few other caveats the paper itself flags. The authors' deepest experiments are on visual-document retrieval; general-text generalisation is acknowledged but not deeply tested. The method is for dense retrieval - "not beneficial for global tasks like classification". I'd want to see SMART on standard BEIR-style text benchmarks before treating the inference-only gains as a general claim.

What this doesn't fix

Storage. Late interaction's storage tax doesn't go away with SMART - you still need the per-token hidden states at query time. The deployment-friction half of the problem (which I now think was the bigger half) is solved; the index-size half is unchanged. Pair SMART with one of the modern quantisation stacks - BBQ, OSQ, TurboQuant, RaBitQ - and you've got a credible production architecture.

The pattern, again

I keep finding the same shape in the things I write about. DiskBBQ's restrictive-filter optimisation: a small cheap structure tells the search where not to look. Elasticsearch 9.4's vector lookup: don't ship the data out, the cluster can fetch its own. SID-1: train the retrieval loop end-to-end rather than stitching it together by hand. Each time the move is the same - stop accepting the obvious interface; look at what's actually there.

SMART is that instinct applied one layer deeper, to the model architecture itself. The pooled embedding isn't the only thing your embedder produces. It's just the only thing we've been treating as the output.

The primer closed on "the gap between research and production deployment is narrowing fast." A year on, that gap just got considerably smaller. If you've been intrigued by late interaction but the model-swap was the blocker, that blocker is gone - you already have the model. The storage and serving questions still need answering, but the bar to experimenting just dropped a lot.