ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Published 29 Jan 2026 in cs.CL | (2601.21558v1)

Abstract: LLMs are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented LLM agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an automated synthesis pipeline for multi-turn agent trajectories, integrating SFT and RL protocols to improve tool-based reasoning.
It employs graph-based random walks and QA-based decompositions to generate rule-verifiable simulation environments, ensuring semantic realism.
Experimental results across benchmarks demonstrate ASTRA’s efficacy in enhancing multi-step planning and stable, efficient agent training.

ASTRA: Automated Synthesis of Agentic Trajectories and Reinforcement Arenas

Motivation and Problem Statement

Tool-augmented LLM agents address increasingly complex interactive tasks requiring API invocation, long-horizon planning, and multi-step decision making. However, current agent training protocols are constrained by manual annotation, simulations that lack rule verifiability, exclusive reliance on either SFT or RL, and instability in long-horizon learning. ASTRA introduces a fully automated, end-to-end framework that integrates scalable multi-turn trajectory synthesis and verifiable environment generation to explicitly target these limitations.

Automated Multi-Turn Data and Environment Synthesis

Tool-Chain-Based Trajectory Synthesis Pipeline

ASTRA's SFT methodology leverages large-scale MCP server registries and tool emulators to construct diverse and executable multi-turn data. The pipeline normalizes disparate tool schemas into a unified format, filters for actionable documents, and employs graph-based random walks for tool-chain creation, verifying dependencies and task coherence via LLM. Controlled augmentation (paraphrase, complexity, persona) and a multifaceted task scoring regime ensure both semantic realism and tool-use necessity.

Figure 1: Pipeline for multi-turn tool-chain trajectory synthesis including document normalization, chain construction, and augmentation.

After chain validation, SFT trajectories are realized by hybrid execution across both real and simulated tool endpoints, emulating real-world unreliability and logging outputs for fine-tuning.

QA-Based Verifiable Environment Synthesis

For RL, ASTRA systematically decomposes semantic reasoning via QA instance synthesis, extracting structured multi-hop subgoals and representing dependency topology as an explicit DAG. Samples are filtered for tool-use steps, atomicity, rationality, and task completeness, generating code-executable Python environments for each validated sub-question and merging functionally equivalent sub-environments to constrain the action space.

Figure 2: Framework for QA-based environment synthesis, converting decomposed reasoning traces into independent, rule-verifiable simulation environments.

This process yields training instances paired with isolated rule-verifiable simulators, facilitating trajectory-level online RL.

Unified Agentic Training Protocol

Supervised Fine-Tuning (SFT)

Fine-tuning is performed on Qwen3-14B and Qwen3-32B using high-frequency checkpointing with decoupled weight/training-state snapshots for observability and efficiency. Context parallelism supports long-horizon training, and the pipeline enables models to internalize structured tool use and multi-turn conventions.

Reinforcement Learning (RL)

ASTRA frames policy learning as online RL over instance-specific, sandboxed environments. The agent executes tool sequences (Figure 3) until a turn or sequence cap is reached, with rollouts collected for trajectory-level optimization using the GRPO objective (KL and entropy terms omitted for stability). Adaptive batch filling (Figure 4) ensures that batches only contain rollouts with informative reward variance, mitigating collapse in advantage estimates and promoting stable training.

Figure 3: Online multi-turn rollout procedure for agentic RL over isolated environments.

Figure 4: Adaptive batch filling scheme guarantees a batch of non-degenerate, learning-active trajectories.

Agent Robustness via Irrelevant Tool Mixing

Tool discrimination is induced by augmenting each instance with controlled sets of semantically diverse, task-irrelevant tools sampled according to embedding-based similarity bands. This augmentation is critical for promoting negative tool judgment and robustness under expansive tool inventories.

Figure 5: Ablation of irrelevant-tool mixing reveals necessity for discrimination pressure and negative tool judgment.

Trajectory-Level F1 Reward Design

ASTRA proposes a trajectory-level F1-style reward, jointly optimizing sub-task recall and precision of tool invocation. Comparative ablations (Figure 6, 8) demonstrate that recall-only rewards drive overuse and instability, whereas precision-only rewards induce brittleness; F1 harmonic mean produces balanced exploration and stable optimization.

Figure 6: Reward design ablation compares recall-only, precision-only, and F1 reward effects on training and performance.

Figure 7: Dialogue turn distribution evolution for different reward configurations, showing that F1 design yields well-behaved, efficient interactions.

Experimental Results

Extensive evaluation on agentic benchmarks—BFCL v3 Multi-Turn, $\tau^2$ -Bench, and ACEBench—at both 14B and 32B scales shows that ASTRA-trained models consistently outperform open-source baselines and are competitive with closed-source systems, especially in multi-step and long-context settings. RL-driven gains are confirmed across all metrics, while performance on non-agentic reasoning (AIME) remains undiminished, evidencing robust core logical competence.

Figure 8: Model performance comparison on BFCL v3 Multi-Turn with ASTRA attaining competitive accuracy versus closed and open-source alternatives.

Training Stage Analysis

Stage-wise analysis demonstrates that SFT delivers a step-change in multi-turn adaptation and output conciseness, while RL augments this foundation, broadening exploration over the semantic/compositional topology and optimizing task-oriented, efficient behavior. Output statistics show SFT compresses response length, RL converges to efficient intermediate lengths, and interaction depth remains stable across protocols.

Large-Scale Data and Environment Coverage

ASTRA synthesizes 54,885 multi-turn SFT samples covering 6,765 unique tools across 41 domains (distribution shown in Figures 9, 11). RL environments are built from 6,596 instances, averaging 4.37 reasoning hops with predominance in real estate, e-commerce, and healthcare and domain/task parallelism reflecting real application diversity (Figures 12, 13).

Figure 9: Distribution of messages per SFT sample, illustrating multi-turn complexity.

Figure 10: Top 20 domain distribution in the RL dataset, representing extensive domain coverage.

Figure 11: RL scenario type distribution with a focus on parallel and multi-hop complexity.

Figure 12: Distribution of tool calls per RL sample, emphasizing frequent parallel tool use.

Implications and Future Directions

ASTRA establishes an open-source paradigm for fully automated agentic training, minimizing manual bottlenecks and enabling scalable synthesis of both agentic trajectories and rule-verifiable RL environments. The duality between static tool topology in SFT and compositional semantic topology in RL offers flexibility for diverse downstream applications and robust adaptation to complex, evolving interactive scenarios.

Practically, ASTRA's pipeline enables generation of executable environments per scenario, empowering iterative agent improvement and deployment across real-world domains. Theoretically, it affirms the importance of dense trajectory-level credit assignment, discrimination signals, and compositional environment construction for the advancement of grounded agentic LLMs.

Future work may extend ASTRA to interactive, user-driven online RL with evolving intents and feedback, and incorporate cost-aware automation in environment synthesis, including topology refinement and maximal reuse of validated semantic structures prior to instantiation.

Conclusion

ASTRA delivers a comprehensive, scalable framework for training tool-augmented LLM agents through fully automated trajectory and environment synthesis with integrated SFT and RL protocols. Empirical results validate agentic performance, efficiency, and robust reasoning. By open-sourcing its pipelines and models, ASTRA facilitates reproducibility and accelerates research toward adaptive, generalizable, and verifiable agentic intelligence in open environments (2601.21558).

Markdown