SID-1: Train the loop, keep the index

SID-1 is a Qwen3-14B finetuned with reinforcement learning for one job - agentic retrieval. Given a question and a set of search tools, return the documents needed to answer it, ordered by relevance. On their custom benchmark it hits 0.84 recall against 0.78 for GPT-5.1 (high) and 0.64 for Sonnet 4.5 - ~24x faster than GPT-5.1 (high) and ~374x cheaper than Sonnet 4.5 when self-hosted. The headline numbers are striking. The reasoning behind the choices is what I want to write about.

1. The middle of the pipeline collapses

A typical production retrieval stack today is a chain of hand-designed pieces. An embedder. A reranker. Often a query classifier or rewriter. Maybe a sparse component for hybrid scoring. Each piece tuned by hand, each one with a knob someone owns, each one replaceable independently. This is the world I described in Everything has changed in search, nothing has changed in search - the separation-of-concerns scaffolding we’ve built around the expensive thing.

SID-1’s first move is to throw out the middle of that chain and ask a different question: what if you just trained a model to do the whole loop? Not the embedding alone, not the reranker alone, but the iterative process - search, read excerpts, reason about what you’ve found, decide whether to search again, eventually report a ranked list.

The standard embed-plus-rerank pipeline on their benchmark gets 0.45 recall. SID-1 (4x) gets 0.84. The team is direct about why: "replacing human design with compute is fundamentally what drives SID-1 to outperform these mechanisms."

It’s worth being precise about what "human design" means in that sentence. The thing that went away is the intermediate engineering - the reranker, the query rewriter, the glue between them. The task itself, the reward shape, the available search tools, and the choice of a 14B base model are all hand-designed by the SID team. So this isn’t quite Sutton’s bitter lesson (generality scaled with compute eating domain priors). It’s the narrower move I’ve been writing about: middle-of-the-pipeline hand engineering collapses into one trained component, and the architectural boundaries on either side of it stay put.

The xAI piece had the same shape at the recsys layer: engineering that lived within a stage went away. Here it’s engineering that lived between stages. Either way, the retrieve/rank split survives. SID-1 still calls a vector search backend; the model is the planner-and-ranker, the index is the index. That boundary holds, exactly as it did for Waldo, exactly as it did for the xAI pipeline. The pattern goes all the way down, and the boundary between storing information and computing over information is too economically sharp to dissolve.

One thing worth pushing on before you take the 0.45 → 0.84 number at face value: the 0.45 is from an off-the-shelf chain (Qwen3-Embedding-0.6B plus MXBAI Rerank Large v2), not a pipeline a team had spent months tuning for their specific corpora. A well-loved production stack would close some of that gap. But the direction of the result - that the trained-loop approach has more headroom than the stitched-chain approach - is the thing I’d bet on, even if the magnitude moves.

2. NDCG-as-reward, then deliberately bent

The reward section is the part of the paper I keep coming back to. Of all the metrics they could have used - recall, precision, F1, answer correctness judged by an LLM - they picked NDCG with binary relevance, and they wrote a paragraph explaining why.

Their argument is the IR field’s argument, almost word for word. Recall is hackable: the recall-optimal policy is to report every document you ever saw. Precision is hackable the other way: report only your single most confident guess. F1 balances the two but is indifferent to ordering. NDCG cares about which documents you found and where you placed them in the ranking, and it’s been the standard ranking metric in IR for two decades.

Then they bend it. The team is explicit that they train the model to slightly overreport on the grounds that omission costs a downstream system more than a few spurious documents do. You can see the result in the precision column of their headline chart: SID-1 (1x) precision is 0.16, SID-1 (4x) drops to 0.10. The frontier agentic baselines sit much higher - Sonnet 4.5 at 0.35, GPT-5.1 (high) at 0.54. NDCG-with-binary-relevance gets them ranking-aware partial credit; the recall bias gets layered on top. The reward isn’t just NDCG, and the resulting model isn’t a precision champion - it’s a recall champion that ranks the documents it finds.

I find this more interesting than a clean "they used NDCG and won" story. They reached for the right primitive from the IR literature, understood it well enough to know what it doesn’t optimise for, and then deliberately tilted the training signal toward the failure mode they preferred. That’s not a team rediscovering the field. That’s a team that knows the field and is making opinionated choices inside it.

When I wrote about the harness being mostly retrieval two days ago, the thing I most wanted applied-AI teams to internalise was that the IR field already had the apparatus - judged sets, graded relevance, nDCG, the whole methodology. SID-1’s reward function is that apparatus, picked up and dropped into an RL loop, by people who knew exactly why each choice was the right one - including which ones to break.

The data section continues the pattern. They distinguish between the target documents you happen to have (T_q) and the theoretical ideal target set (G_q), enumerate the ways your training dataset can be wrong, and demonstrate empirically that training on noisy public benchmarks - HotpotQA’s "Which actor does American Beauty and American Beauty have in common?" with the soundtrack listed alongside the film as a correct answer - causes the model to overreport documents indefinitely. This is noisy-judgement reasoning straight out of the IR evaluation literature, applied to RL training data instead of a relevance assessor pool, reaching the same conclusions the field already reached.

This is what the manual looks like when it’s been read.

3. The harness, collapsed into one trained model

Last week I argued the harness is mostly retrieval - that tool selection, context engineering, guardrails, and verification are all retrieval-shaped tasks dressed in new vocabulary. SID-1 is the next step on the same line: the agentic-retrieval loop - decompose the query, pick a search call, read the results, decide whether to continue - collapsed into one set of trained weights.

The bit I most want to draw attention to is composability. SID-1 returns documents, not answers. The team is explicit that this is the point: decouple retrieval competence from synthesis so the model slots into a larger system as a subagent. This is Waldo with sharper teeth. Cheap specialist in front of expensive generalist, the pattern from a couple of weeks back, pushed one layer in - the cheap specialist is no longer a small reasoning model handing context to GPT-5, it’s a fine-tuned 14B retrieval model handing documents to whatever you’re already using for synthesis.

The operating-point note from the xAI piece lands here too. SID-1 self-hosted on SF Compute is ~ $0.0006 per question. Sonnet 4.5 doing the same job agentically is ~$ 0.41 per question. That’s ~374x. The right comparison isn’t "should I use SID-1 or GPT-Nano" - both are cheap. It’s "should the retrieval loop inside my product be a 14B specialist that beats every frontier model at its specific job, or an API call to a generalist that does it worse, slower, and orders of magnitude more expensively?"

Two pieces of small print on that ~374x, though. The first is from the paper itself: "SID-1 pricing will almost certainly be higher than this, given we would cease to exist if they were lower." $0.0006 is what it costs to run Qwen3-14B on SF Compute, not what SID AI will charge you. The second is the category mismatch - a hosted API call has zero ops burden, self-hosting a 14B model has a lot of it. The honest comparison against the Sonnet 4.5 API isn’t the SF Compute cost of inference; it’s whatever SID-1’s hosted offering ends up priced at. Until that exists, the cost story is directionally right but quantitatively soft.

The shape of the answer is still the same: a specialised model trained for the retrieval loop, used as a subagent, is a structurally different proposition to renting a frontier generalist by the token.

Caveats worth holding

This is one paper, from one team, evaluated on their own benchmark, with roughly half the questions custom-designed by the same people who built the model. The HotpotQA and Scifact numbers are saturated for everyone, so the differentiator is the bespoke 100-question set. I’d want to see the same evaluation harness run by an independent team before I treat 0.84 as gospel. Their honesty about the noisiness of public retrieval benchmarks is genuine and the public IR community needs to do better here - but that doesn’t mean their replacement set is calibrated correctly either.

I’d also note that SID-1 (1x) at 0.77 recall and GPT-5.1 (high) at 0.78 are - on this benchmark, at this moment - essentially tied. The 0.84 number requires four parallel rollouts fused with reciprocal rank. Still impressive at the cost, but it isn’t "single 14B model beats GPT-5.1 outright at retrieval." It’s "four parallel rollouts of a 14B model, cheap enough to run in parallel, beat GPT-5.1 at retrieval." Different sentence, also interesting.

What I take from it

Three threads I’ve been pulling on recently keep meeting in the same place.

When middle-of-the-pipeline hand engineering goes, the architectural boundaries on either side of it stay put. SID-1 is that move arriving for retrieval pipelines: the embedder-plus-reranker chain is the obvious thing the trained model replaces; the retrieve/rank split survives.

The IR field has the methodology that applied-AI teams need, and the teams who notice this win quickly. SID-1 picked up NDCG, binary relevance, and judged-set noise analysis - and bent them where they thought the defaults didn’t fit. The result is a model that doesn’t make the common mistakes, by people who knew enough about the common mistakes to choose which ones to make on purpose.

And the harness has one more layer than it looked like. I described the harness as mostly retrieval - candidate generation, ranking, filtering, evaluation. SID-1 says the loop over those operations is itself a learnable thing now: not orchestration code in your application, but trained weights that someone hands you as a subagent.

If you’re building agentic search today, the question is shifting. It’s not "which embedder, which reranker, which prompt for query rewriting." It’s "should this entire stage of my product be a model whose only job is to do this stage?"

Increasingly, the answer is yes. And the thing handing it to you isn’t a frontier lab. It’s a retrieval lab. This fills me with joy ❤️

1. The middle of the pipeline collapses#

2. NDCG-as-reward, then deliberately bent#

3. The harness, collapsed into one trained model#

Caveats worth holding#

What I take from it#

1. The middle of the pipeline collapses

2. NDCG-as-reward, then deliberately bent

3. The harness, collapsed into one trained model

Caveats worth holding

What I take from it