The pattern goes all the way down

Last week I wrote about how agentic search looks like a complete break from classical IR, but the architectural instincts underneath are old: cheap specialist in front of expensive generalist, planner in front of frontier model, caches in front of databases. Same separation-of-concerns moves we’ve made for decades, just with new kinds of components.

Earlier today, Ben Trent (who is a total G in the search world) posted Faster restrictive filters in DiskBBQ and made me realise the pattern doesn’t stop at the system boundary. It goes all the way down.

DiskBBQ is Elasticsearch’s partition-based vector index - vectors are clustered around centroids, and a query first finds nearby centroids, then scans vectors within them. It works well for unfiltered ANN queries. But the moment you add a restrictive filter - show me semantically similar videos, but only ones I have permission to view, in English, uploaded this year - most of the work the index does is wasted. You score centroids that contain no matching documents. You load posting lists that get filtered down to almost nothing. You decode document IDs you’re going to throw away.

The enhancement Ben describes is delightfully obvious in hindsight: maintain a small doc_id → centroid_ord mapping. When the filter is restrictive enough, intersect it with that mapping first, find the actual centroids that contain matching documents, and only score those. They’ve tuned in a threshold - an average of ~1.25 matching docs per cluster - for when to switch into this “eager” mode. Roughly an order-of-magnitude latency improvement on hyper-restrictive filters; ~3.6ms in nightly benchmarks.

I read the post for the engineering. I want to write about it from the lens of the pattern it represents, and something I’ve been thinking about a lot this week...

A small, cheap structure in front of an expensive scan

That’s all the new mapping is. It’s a secondary index. It doesn’t do the search - the vector index still does the search. It tells the search where not to look. The expensive thing (centroid scoring, posting-list decode, ANN comparison) only runs against the candidates that survive a near-free intersection.

That’s the same move I described last week. Waldo is a small planner in front of a frontier model. A cache is a small in-memory store in front of a database. A queue worker is a deferred consumer in front of a synchronous handler. A bloom filter is a probabilistic structure in front of a disk read. A secondary index is a sorted side-table in front of a heap scan.

Every one of these is the same architectural sentence: put something cheap that knows when not to call the expensive thing.

I appreciate this is an obvious statement and not at all revolutionary thinking - but as engineers we often over-engineer the solution when these simple truths hold true.

It’s fractal

This is what makes the lens useful. It’s not a single insight about one layer of the stack - it’s a pattern that shows up at every layer, and once you see it, you start spotting where the next one should go.

At the system level: planner-in-front-of-frontier-model. (Waldo.)
At the index level: secondary-index-in-front-of-vector-scan. (DiskBBQ’s new mapping.)
At the query level: filter pushdown - apply cheap predicates first so the expensive predicate sees fewer rows. (Query planners since the 1980s.)
At the data level: quantisation - score against a one-byte approximation, fall back to full precision only for the candidates that look promising. (BBQ, PQ, OPQ, every modern vector index.)
At the I/O level: the page cache - RAM in front of disk. (Operating systems since the 1970s.)

Five layers, same instinct. The cheap thing is different at each layer - sometimes it’s a model, sometimes a side-table, sometimes a one-byte approximation - but the shape is identical. Put something quick and good-enough in front of something slow and expensive, and find the threshold where the cost of asking the cheap thing is less than the savings of not asking the expensive one.

Why 1.25?

I also enjoy the "actually-not-arbitrary" threshold... ~1.25 matching docs per cluster. Below it, do the eager intersection. Above it, fall back to the standard search path. It came from benchmarks, but might change 🤷‍♂️

For a nerd like me, this is exactly the conversation relational databases have been having since the 1980s - query planners weighing the cost of an index scan against a sequential scan based on estimated selectivity. The selectivity number is a magic number too, derived from histograms, sampled statistics, and a lot of we tried it and this worked best. Vector search is rediscovering the same calculus, with the same kinds of heuristics, in the same kinds of places.

This is the most reassuring thing about working on AI search systems right now: the planning problems aren’t new. They’re constantly being restated at every layer of the stack. We have decades of work on cost-based optimisation, cardinality estimation, adaptive replanning - and we’re going to need all of it again: for vectors, for tool selection in agentic loops, for routing between cheap and expensive models.

What this means for design

If you’re building or operating a search stack right now, the takeaway isn’t even “use DiskBBQ” or “wait for ES 9.4” (but you should TOTALLY consider this). It’s simpler than that: when you look at any part of your system, ask where the cheap thing is, where the expensive thing is, and what separates them.

If those two things sit at the same layer - if your hot path is doing both the trivial filtering and the expensive scoring in one breath - you have an optimisation opportunity. The pattern says: pull them apart. Put a small, cheap structure between them. Measure, and find the threshold.

The pattern goes all the way down because the principle is general: information is cheap, computation is expensive, and the design problem is figuring out the smallest amount of information that lets you skip the most computation.

Everything has changed in search. Nothing has changed in search. The new components are doing what the old ones did, at every level, all the way down.

Perhaps I’m just viewing everything through my own little lens of recent thought, but as retrieval systems continue to bring huge advancements in every release my engineering brain gets so excited: There is so much left to discover and so much more to optimise. I love waking up each day and living in this reality.