- The paper introduces ZeroSearch, a reinforcement learning framework that simulates search engine behavior using fine-tuned LLMs to reduce API costs and control document quality.
- It employs a curriculum-based rollout that incrementally increases document noise to train the policy model in robust reasoning under varying information quality.
- Experiments on diverse QA datasets demonstrate that ZeroSearch outperforms baselines and real search APIs, offering improved stability and performance.
ZeroSearch is a reinforcement learning (RL) framework designed to improve the information search capabilities of LLMs without the need to interact with real-world search engines. The paper addresses two main challenges faced by previous RL approaches that used live search: the uncontrolled quality of retrieved documents, which introduces noise and instability into training, and the high API costs associated with frequent search requests needed for RL rollouts.
The core idea behind ZeroSearch is to leverage the knowledge already embedded in LLMs from their pretraining to simulate the behavior of a search engine. Instead of calling an external API, the LLM generates the search results itself. This simulation provides crucial advantages: it incurs zero API costs, enabling more extensive RL training rollouts, and it allows for explicit control over the quality of the generated documents, addressing the instability issue.
The ZeroSearch framework involves several key components:
- Search Simulation Tuning: A smaller LLM is fine-tuned using supervised learning (SFT) to act as a simulated search engine. This fine-tuning is based on trajectories collected from interactions with a real search engine. Crucially, the SFT process trains the simulation LLM to generate documents that are either "useful" (containing relevant information for the question) or "noisy" (containing irrelevant information). This distinction is achieved by adjusting simple keywords in the prompt provided to the simulation LLM during generation, as shown in the paper's Table 2. The input question and its ground truth answer are also included in the prompt during SFT to broaden the simulated LLM's knowledge.
1
2
3
4
5
6
7
8
9
|
You are the Google search engine.
Given a query, you need to generate five [useful / noisy] documents for the query.
The user is trying to answer the question: [question] whose answer is [ground truth].
Each document should contain about 30 words, and these documents should contain [useful / noisy] information.
Query: [query]
[Useful / Noisy] Output: |
This SFT process helps the simulation LLM mimic the style of real search results and provides the mechanism to control document quality.
- Curriculum-Based Rollout: During the RL training of the policy model, the quality of the documents generated by the simulation LLM is progressively degraded. This is implemented using a probability function pi=ps+b−1bi/m−1(pe−ps), where pi is the probability of generating noisy documents at step i, ps is the starting noise probability, pe is the ending noise probability, m is the total number of training steps, and b is an exponential base (default 4). As training progresses (i increases), pi increases, meaning the policy model is exposed to more challenging scenarios with lower quality search results. This curriculum helps the policy model first learn basic interaction patterns and then develop more robust reasoning strategies to handle noisy information.
- Training Template: The policy LLM interacts with the simulated search engine following a structured multi-turn template inspired by previous work like Search-R1 (2503.09516). This template guides the model through three stages:
> ...
for internal reasoning, <search>...</search>
to issue a query, and <answer>...</answer>
to provide the final response. This structure makes the model's decision-making process more transparent.
1
2
3
4
5
|
Answer the given question.
You must conduct reasoning inside <think>and</think> first every time you get new information.
After reasoning, if you find you lack some knowledge, you can call a search engine by <search>query</search>, and it will return the top searched results between <retrieved_docs>and</retrieved_docs>.
You can search as many times as you want.
If you find no further external knowledge needed, you can directly provide the answer inside <answer>and</answer> without detailed illustrations. For example, <answer>Beijing</answer>. Question: |
- Reward Design: The reward signal used for RL training is based solely on the F1 score between the model's final answer and the ground truth. The F1 score is calculated as rϕ(x,y)=PN+RN2×IN, where IN is the number of overlapping words, PN is the number of words in the prediction, and RN is the number of words in the ground truth. This metric was chosen over exact match (EM) to prevent reward hacking (where the model generates excessively long answers to increase the chance of hitting the exact words) while still rewarding factual correctness.
- Training Algorithm & Stability: ZeroSearch is compatible with various RL algorithms like PPO (2017.06347) and GRPO (2402.03300). To ensure training stability, especially because retrieved documents are external to the policy model's direct control, the framework incorporates a loss masking mechanism. Gradients are only computed with respect to the policy model's own generated tokens, not the tokens within the simulated search results (
<retrieved_docs>...</retrieved_docs>
). Experiments show that applying this loss masking significantly improves performance and stability.
Implementation considerations include the need for GPU infrastructure to run the simulation LLM during training rollouts. While this replaces expensive API calls, it introduces GPU costs. The paper discusses that a 3B simulation LLM is sufficient to effectively train the policy model, a 7B simulation LLM achieves performance comparable to Google Search, and a 14B simulation LLM can even surpass it. This offers flexibility in balancing simulation quality and GPU resource usage. The cost analysis table shows that even with GPU costs, ZeroSearch is substantially cheaper for a typical RL training run compared to using a real search API like SerpAPI for thousands of queries. Sharing a simulation server across multiple training tasks is suggested as a way to further optimize GPU utilization.
Experimental results across various single-hop and multi-hop QA datasets (NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle) using Qwen and LLaMA models (3B, 7B, 14B, Base, Instruct) demonstrate that ZeroSearch consistently outperforms baselines, including prompt-based methods, RAG variants, and even the Search-R1 method which uses a real search engine. The framework shows strong generalizability across different model types and sizes. The curriculum learning strategy is shown to be effective compared to a reverse curriculum.