The harness is mostly retrieval

Laurie Voss published The end of fine-tuning, and it really resonated with me.

The argument, briefly: OpenAI shutting down self-serve fine-tuning didn’t kill anything - it revealed a split that had already happened. A tiny number of frontier companies still train continuously against production. Everyone else has quietly moved their iteration off the model weights entirely, into what we've all started calling the harness - the prompts, the tools, the retrieval logic, the guardrails, the verification gates. Agent = Model + Harness. You stop changing the model. You change everything around it.

I think that’s correct, and I think it matters. But one of those parts is much bigger than the list makes it look.

Voss lists retrieval as a component of the harness - one item alongside prompts, tools, guardrails and verification. One of five, give or take. I don’t think it’s one of five. I think the harness is mostly retrieval, and the other names are retrieval wearing different hats.

Anyone who has worked with me will already be rolling their eyes... I always say "Everything is search and search is everything", so let me apply this mantra here:

Tool selection - deciding which of your fifty connected tools to call for a given request - is routing. Routing is a retrieval problem: take a query, score a set of candidates, pick the top few. Teams are already building this with embeddings over tool descriptions, because that is, plainly, retrieval.

Deciding what goes in the context window is the most retrieval-shaped task in the whole stack. You have more candidate material than fits. You score it for relevance, rank it, and truncate to a budget. That isn’t a new thing called "context engineering" - it’s a search engine’s results page with a token limit where the ten blue links used to be.

Guardrails - keeping the inputs and outputs you don’t want out - are filters. Every search system in the world runs filters: permissions, freshness, locale, safety. Same operation, same slot in the pipeline.

Verification - judging whether an answer is actually good - is the one reframe I’ll qualify. Some of it is plain correctness: did the code compile, did the call return a 200. That part isn’t retrieval, and I won’t pretend it is. But the fuzzier half - is this answer relevant, is it good enough, did it actually meet the need - is relevance assessment, and deciding that at scale is the oldest problem information retrieval has. There are decades of method behind it: judged sets, graded relevance, nDCG, the whole apparatus.

Strip the new vocabulary and the harness resolves into something old: candidate generation, ranking, filtering, and evaluation against real traffic. That is a retrieval system - and every one of those operations has been studied, named and argued over under that banner since the 1970s.

So here is what "the end of fine-tuning" actually means, said plainly. It doesn’t mean a new discipline has appeared that everyone must now invent. It means a discipline that predates most of the people now doing applied AI - information retrieval - just became the main event. The work moved from a place most product teams could never reach (model weights, GPU clusters, ML PhDs) to a place they can: the retrieval system wrapped around the model.

That’s the genuinely good news here. The quieter problem is that a great many teams are rebuilding IR from scratch without realising the field is there to draw on - not through any fault of theirs, nobody pointed them at it. Spend an hour reading RAG posts about the perfect chunk size - 256 tokens or 512, fixed windows or semantic splits - and you are watching a field re-derive passage retrieval, a question IR was already deep into in the 1990s. The same teams are rediscovering that you need a judged evaluation set, that BM25 is a stubbornly strong baseline, that ranking and filtering are different operations that fail in different ways. All of it is already written down - much of it before the people rediscovering it started their careers. That’s not a criticism. It’s an invitation.

I’ve been circling this for a while - a few weeks ago I argued that agentic search only looks like a clean break from classical retrieval, and that the architecture underneath is the one we already had. If iteration has moved to the harness, and the harness is mostly retrieval, then the most valuable person on an applied-AI team is no longer the one who can fine-tune. It’s the one who can build, measure and tune a retrieval system - and they’re sitting on decades of method they may not realise they have.

None of this contradicts Laurie. It’s his observation, pushed one step on with my obvious existing bias. He says: stop iterating the weights, start iterating the harness. I’d add only this - the harness has an older name, the name is retrieval, and the manual already exists. I encourage any engineer to go and read it, and fall in love with it like I did. Everything is search and search is everything ❤️