Internet Search is Not a Naive Information Retrieval Problem

"During RL training, we employ a curriculum-based rollout strategy that
incrementally degrades the quality of generated documents, progressively eliciting the model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZEROSEARCH effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it."
https://arxiv.org/pdf/2505.04588
The research demonstrates something interesting about language models' ability to simulate search behavior in controlled conditions. But claiming equivalence to a "real search engine" is like saying you've built a military defense system because your soldiers performed well in peacetime maneuvers. The real test isn't whether it works when nobody's trying to break it—it's whether it works when half the internet is trying to game it for profit.
To illustrate, imagine a small corpus with two documents:
- Mr. Fox is great.
- Mr. Fox is not great.
If the search term is "Mr. Fox," then, from the perspective of semantic relevance, the two documents are equal. Instead, to build a more useful ranking, you need some signal of consumer demand, which would include biases toward Mr. Fox, perceptions of trustworthiness, etc., that presumably reflect consumer utility.
Now, imagine I use GenAI to flood the Internet with 100,000 pages praising Mr. Fox. These aren't crude spam pages—they're well-written articles with proper grammar, coherent arguments, and seemingly legitimate citations. Each page offers minor variations on the same theme: "Mr. Fox is innovative," "Mr. Fox shows exceptional leadership," "Studies confirm Mr. Fox's approach is effective."
From a pure information retrieval perspective, a language model examining this corpus would find overwhelming "evidence" that Mr. Fox is great. LLMs have no built-in mechanism to recognize that these pages are 'artificial' unless we model signals like "All 100,000 pages appeared within the same week", "None have meaningful engagement from real users", etc.
A traditional search engine must solve this problem daily. When a new movie, product, or politician appears, millions of dollars might be spent generating artificial positive content. Without robust manipulation resistance, the algorithm would surface whatever content was most plentiful or most aggressively optimized.
Real search engines don't primarily compete on finding relevant documents.* They compete on resisting manipulation. The moment Google's algorithm became valuable, an entire industry emerged dedicated to gaming it. Every ranking factor becomes a target for optimization, spam, and abuse. Search engines spend enormous resources not just on relevance, but on detecting artificial link schemes, content farms, cloaked pages, and sophisticated manipulation tactics that evolve daily.
The point is broader still. Search engines also incentivize the supply of good content by figuring out ways to pay for it, e.g., ads, by helping customers find the content, etc. (Read more here.) Absent incentives for good content and penalties for bad content, the equilibrium is a race to the bottom.**
*The search engine space hasn't been very competitive, and it is likely the case that the investment has gone down here. Product search is another great example, where it seems we have stopped caring about good data.
**Naive LLM-based search engines can still benefit from user feedback to incentivize certain kinds of generations.