Agentic Task Synthesis Pipeline

Updated 3 December 2025

The paper introduces a pipeline that synthesizes QA pairs via injection and fuzzing, integrating them into asynchronous RL for agentic AI training.
The methodology employs a parallel actor–learner architecture and rigorous filtering to achieve significant F1 improvements on multi-hop QA and web search benchmarks.
The pipeline is significant for enabling robust long-horizon reasoning and effective tool-use capabilities through scalable data generation and curriculum scheduling.

A large-scale agentic task synthesis pipeline is a systematic framework for the automated creation, verification, and utilization of complex interaction data tailored for training and evaluating agentic AI systems. Such pipelines are designed to produce extensive datasets containing multi-step problem-solving episodes, tool-use traces, and challenging, verifiable trajectories, enabling agents—especially LLMs—to develop robust, long-horizon reasoning and tool-interaction capabilities in diverse environments. The pipeline described in "Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL" (Gao et al., 11 Aug 2025) exemplifies current best practices in scaling both data generation and RL training for agentic agents.

1. Stages of the Agentic Task Synthesis Pipeline

The pipeline is organized into three principal stages: QA dataset construction, asynchronous RL training, and rigorous evaluation.

QA Dataset Construction
- Seeds are collected from open-source multi-hop datasets such as HotpotQA and 2WikiMultiHopQA.
- Hard samples are filtered by requiring ≥2 search turns and <50% zero-shot accuracy by a preliminary agent.
- A prompt-based LLM agent expands ~14,000 seeds into 134,000 QA pairs, with 25.6k demanding tool use.
- After filtering, the final mix forms 35,000 training QAs (16,000 open-source, 19,000 synthetic).
Asynchronous RL Training
- Agents interact with an environment via search engines and browsers; webpage content is summarized on-the-fly.
- RL is performed using Generalized Ratio-Clipping Policy Optimization (GRPO), with sparse, end-of-episode rewards.
- The architecture is fully asynchronous with decoupled actor–learner roles, parallel tool calls, and efficient trajectory batching.
Evaluation & Analysis
- Agents are evaluated on standard single- and multi-hop QA (with local retrieval), web search tasks, and challenging long-horizon benchmarks (GAIA, xBench-DeepSearch, Frames).
- Ablation studies assess the impact of turn limits, dataset scale, and asynchronous versus batch training.

The data and training flow can be diagrammed as:

1	Seed QAs → Open-Source Filter → Synthetic QA Generation → Combined QA Dataset → Asynchronous RL Training → Trained Agent → Evaluation

2. Prompt-Based QA Synthesis Mechanisms

Architecture and Generation

A single LLM agent (typically Qwen2.5-32B) is prompted to synthesize more challenging QA pairs through two mechanisms:

Injection: Incorporate new factual clauses (snippets from Wikipedia) about entities in the seed question.
Fuzzing: Replace concrete terms in the question with ambiguous phrases (e.g., exact dates become "early 1930s" or specific names are generalized).

Multi-Stage Filtering

The QA generation undergoes several quality-control stages:

Basic Quality Checks: A second LLM verifies clarity and factual dependency.
Difficulty Measurement: Questions are rejected if the agent answers them correctly >25% of the time in zero-shot.
Uniqueness Check: Alternative correct answers arising from fuzzing are checked and discarded.

Dataset Statistics

Average injections per seed: 6.3
Average fuzzes per seed: 3.2
Selected subset: 25,624 QAs require ≥1 tool turn
Supporting-fact histogram peaks at 3–4 facts
Zero-tool accuracy of QwQ-32B: 80% at 0–2 fuzz drops to 20% at ≥5 fuzz

3. Asynchronous RL Training Framework

System Architecture

The framework leverages a distributed actor–learner paradigm:

Actors: Each worker independently pulls the latest policy parameters, interacts with the environment (up to 128 turns for large models), interfaces with search and browse tools via REST, and pushes trajectory data into a central queue.
Learner: Consumes batches of trajectories (G=16), computes gradients via GRPO, and updates central parameters.

Pseudocode

while True:
  θ_local = pull_parameters()
  τ = roll_out_episode(env, π_θ_local, max_turns)
  push_to_queue(τ)

while True:
  batch = sample_queue(G)
  loss = compute_GRPO_loss(batch, θ)
  θ = θ + α ∇θ loss
  push_parameters(θ)

Objective and Loss

The optimization maximizes expected return:

$J(\theta) = E_{\tau\sim\pi_\theta}[R(\tau)]$

using the GRPO surrogate:

$\mathcal{J}_{GRPO}(\theta) = E_{\tau\sim\pi_{\theta_\text{old}}} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|\tau_i|} \sum_{t,j} \min\left(r_t^i \hat{A}_i, \text{clip}(r_t^i,1-\epsilon,1+\epsilon)\hat{A}_i\right) \right]$

Hyperparameters:

Learning rate $\alpha = 2 \times 10^{-5}$ , batch size $G=16$ , max turns = 32 (7B/14B), 128 (32B), discount factor $\gamma=1.0$ , PPO clip $\epsilon=0.2$ , entropy bonus $\beta=1 \times 10^{-3}$ .

4. Integration of QA Synthesis and RL Training

Environment for Each QA Task

Each QA defines an MDP: state $s_0$ is the question, actions include "think" tokens and <search>/<browse> tool calls, termination on <answer>.
Rewards: $R = \text{F1}(\text{pred},A)\times\text{format\_bonus}$ for LLMs, $0/1$ judged by LLM-as-Judge for QwQ.

Curriculum and Difficulty Scheduling

Training starts with easier open-source filtered QAs.
Synthetic QAs are injected gradually: 40% in stage two, 80% in the final stage.
QAs are dynamically filtered to maintain reward variance among grouped trajectories.

5. Evaluation, Ablation, and Key Results

Benchmarks and Metrics

Standard QA (retrieval): HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle, NQ, TriviaQA, PopQA.
Web-based QA: GAIA, xBench-DeepSearch, Frames.
Metrics: word-level F1, LLM-as-Judge Avg@4 and Pass@4 (four independent runs).

Quantitative Results

Local-RAG QA (7B): F1 +3.7 over Search-R1 7B; (14B): F1 +4.1 vs Search-R1 14B
Web QA (14B): Avg F1 61.5 vs SimpleDeepSearcher-32B 58.4
GAIA: Avg@4 52.8 vs best open-source 48.1 (+4.7); Pass@4 70.1 vs 67.0
xBench-DeepSearch: Avg@4 42.1 vs 40.3; Pass@4 68.0 vs 65.0
Agents can learn strategies exceeding 40 tool calls and generate >150,000 tokens in training.

Ablation Studies

Accuracy increases with turn-limit enforcement (from T=4 to T=32).
Asynchronous training achieves 3× higher throughput (600 vs. 200 episodes/hr).
35k mixed QAs outperform 16k open-source QAs alone by +10% F1.

6. Implementation Guidelines and Open-Source Artifacts

Code Organization

/data/: JSONL files for filtered open-source and synthetic QA
/synthesis/: LLM prompts, injection/fuzz wrappers
/rl/: actor/learner logic, GRPO implementations
/tools/: search engine, browser REST stubs, cache
/eval/: local-RAG and web evaluation, LLM-as-Judge integration

Dependencies and Setup

Python 3.10, PyTorch 2.0, Transformers ≥4.30
Redis/RabbitMQ for queue
FastAPI for search/browser wrappers
GPUs (H100 or A100); training Web-QwQ required ≈7,600 GPU-hours

Reproducibility Workflow

Install dependencies
Download seed QAs and open-source data
Launch tool servers
Run data synthesis
Start actor processes (N copies)
Start learner process
Run evaluation after checkpointing

All code, models, prompts, and synthetic datasets are openly available at [https://github.com/inclusionAI/ASearcher].

7. Context and Significance

The large-scale agentic task synthesis pipeline outlined here addresses the bottlenecks in search-agent training by combining principled data generation, quality filtering, curriculum difficulty scheduling, and scalable asynchronous RL. Empirically, this enables open-source agents to achieve expert-level performance in long-horizon web search, with increased data diversity, difficulty, and reduced reliance on external LLMs. The modular, reproducible component design and architecture, together with open benchmarks and datasets, provide a foundation for advancing agentic intelligence in both academic and industrial research contexts (Gao et al., 11 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Large-Scale Agentic Task Synthesis Pipeline.