Agentic Task Synthesis Pipeline
- The paper introduces a pipeline that synthesizes QA pairs via injection and fuzzing, integrating them into asynchronous RL for agentic AI training.
- The methodology employs a parallel actor–learner architecture and rigorous filtering to achieve significant F1 improvements on multi-hop QA and web search benchmarks.
- The pipeline is significant for enabling robust long-horizon reasoning and effective tool-use capabilities through scalable data generation and curriculum scheduling.
A large-scale agentic task synthesis pipeline is a systematic framework for the automated creation, verification, and utilization of complex interaction data tailored for training and evaluating agentic AI systems. Such pipelines are designed to produce extensive datasets containing multi-step problem-solving episodes, tool-use traces, and challenging, verifiable trajectories, enabling agents—especially LLMs—to develop robust, long-horizon reasoning and tool-interaction capabilities in diverse environments. The pipeline described in "Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL" (Gao et al., 11 Aug 2025) exemplifies current best practices in scaling both data generation and RL training for agentic agents.
1. Stages of the Agentic Task Synthesis Pipeline
The pipeline is organized into three principal stages: QA dataset construction, asynchronous RL training, and rigorous evaluation.
- QA Dataset Construction
- Seeds are collected from open-source multi-hop datasets such as HotpotQA and 2WikiMultiHopQA.
- Hard samples are filtered by requiring ≥2 search turns and <50% zero-shot accuracy by a preliminary agent.
- A prompt-based LLM agent expands ~14,000 seeds into 134,000 QA pairs, with 25.6k demanding tool use.
- After filtering, the final mix forms 35,000 training QAs (16,000 open-source, 19,000 synthetic).
- Asynchronous RL Training
- Agents interact with an environment via search engines and browsers; webpage content is summarized on-the-fly.
- RL is performed using Generalized Ratio-Clipping Policy Optimization (GRPO), with sparse, end-of-episode rewards.
- The architecture is fully asynchronous with decoupled actor–learner roles, parallel tool calls, and efficient trajectory batching.
- Evaluation & Analysis
- Agents are evaluated on standard single- and multi-hop QA (with local retrieval), web search tasks, and challenging long-horizon benchmarks (GAIA, xBench-DeepSearch, Frames).
- Ablation studies assess the impact of turn limits, dataset scale, and asynchronous versus batch training.
The data and training flow can be diagrammed as:
1 |
Seed QAs → Open-Source Filter → Synthetic QA Generation → Combined QA Dataset → Asynchronous RL Training → Trained Agent → Evaluation |
2. Prompt-Based QA Synthesis Mechanisms
Architecture and Generation
A single LLM agent (typically Qwen2.5-32B) is prompted to synthesize more challenging QA pairs through two mechanisms:
- Injection: Incorporate new factual clauses (snippets from Wikipedia) about entities in the seed question.
- Fuzzing: Replace concrete terms in the question with ambiguous phrases (e.g., exact dates become "early 1930s" or specific names are generalized).
Multi-Stage Filtering
The QA generation undergoes several quality-control stages:
- Basic Quality Checks: A second LLM verifies clarity and factual dependency.
- Difficulty Measurement: Questions are rejected if the agent answers them correctly >25% of the time in zero-shot.
- Uniqueness Check: Alternative correct answers arising from fuzzing are checked and discarded.
Dataset Statistics
- Average injections per seed: 6.3
- Average fuzzes per seed: 3.2
- Selected subset: 25,624 QAs require ≥1 tool turn
- Supporting-fact histogram peaks at 3–4 facts
- Zero-tool accuracy of QwQ-32B: 80% at 0–2 fuzz drops to 20% at ≥5 fuzz
3. Asynchronous RL Training Framework
System Architecture
The framework leverages a distributed actor–learner paradigm:
- Actors: Each worker independently pulls the latest policy parameters, interacts with the environment (up to 128 turns for large models), interfaces with search and browse tools via REST, and pushes trajectory data into a central queue.
- Learner: Consumes batches of trajectories (G=16), computes gradients via GRPO, and updates central parameters.
Pseudocode
1 2 3 4 5 6 7 8 9 10 |
while True: θ_local = pull_parameters() τ = roll_out_episode(env, π_θ_local, max_turns) push_to_queue(τ) while True: batch = sample_queue(G) loss = compute_GRPO_loss(batch, θ) θ = θ + α ∇θ loss push_parameters(θ) |
Objective and Loss
The optimization maximizes expected return:
using the GRPO surrogate:
Hyperparameters:
- Learning rate , batch size , max turns = 32 (7B/14B), 128 (32B), discount factor , PPO clip , entropy bonus .
4. Integration of QA Synthesis and RL Training
Environment for Each QA Task
- Each QA defines an MDP: state is the question, actions include "think" tokens and <search>/<browse> tool calls, termination on <answer>.
- Rewards: for LLMs, $0/1$ judged by LLM-as-Judge for QwQ.
Curriculum and Difficulty Scheduling
- Training starts with easier open-source filtered QAs.
- Synthetic QAs are injected gradually: 40% in stage two, 80% in the final stage.
- QAs are dynamically filtered to maintain reward variance among grouped trajectories.
5. Evaluation, Ablation, and Key Results
Benchmarks and Metrics
- Standard QA (retrieval): HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle, NQ, TriviaQA, PopQA.
- Web-based QA: GAIA, xBench-DeepSearch, Frames.
- Metrics: word-level F1, LLM-as-Judge Avg@4 and Pass@4 (four independent runs).
Quantitative Results
- Local-RAG QA (7B): F1 +3.7 over Search-R1 7B; (14B): F1 +4.1 vs Search-R1 14B
- Web QA (14B): Avg F1 61.5 vs SimpleDeepSearcher-32B 58.4
- GAIA: Avg@4 52.8 vs best open-source 48.1 (+4.7); Pass@4 70.1 vs 67.0
- xBench-DeepSearch: Avg@4 42.1 vs 40.3; Pass@4 68.0 vs 65.0
- Agents can learn strategies exceeding 40 tool calls and generate >150,000 tokens in training.
Ablation Studies
- Accuracy increases with turn-limit enforcement (from T=4 to T=32).
- Asynchronous training achieves 3× higher throughput (600 vs. 200 episodes/hr).
- 35k mixed QAs outperform 16k open-source QAs alone by +10% F1.
6. Implementation Guidelines and Open-Source Artifacts
Code Organization
- /data/: JSONL files for filtered open-source and synthetic QA
- /synthesis/: LLM prompts, injection/fuzz wrappers
- /rl/: actor/learner logic, GRPO implementations
- /tools/: search engine, browser REST stubs, cache
- /eval/: local-RAG and web evaluation, LLM-as-Judge integration
Dependencies and Setup
- Python 3.10, PyTorch 2.0, Transformers ≥4.30
- Redis/RabbitMQ for queue
- FastAPI for search/browser wrappers
- GPUs (H100 or A100); training Web-QwQ required ≈7,600 GPU-hours
Reproducibility Workflow
- Install dependencies
- Download seed QAs and open-source data
- Launch tool servers
- Run data synthesis
- Start actor processes (N copies)
- Start learner process
- Run evaluation after checkpointing
All code, models, prompts, and synthetic datasets are openly available at [https://github.com/inclusionAI/ASearcher].
7. Context and Significance
The large-scale agentic task synthesis pipeline outlined here addresses the bottlenecks in search-agent training by combining principled data generation, quality filtering, curriculum difficulty scheduling, and scalable asynchronous RL. Empirically, this enables open-source agents to achieve expert-level performance in long-horizon web search, with increased data diversity, difficulty, and reduced reliance on external LLMs. The modular, reproducible component design and architecture, together with open benchmarks and datasets, provide a foundation for advancing agentic intelligence in both academic and industrial research contexts (Gao et al., 11 Aug 2025).