Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Don't Just Fine-tune the Agent, Tune the Environment (2510.10197v1)

Published 11 Oct 2025 in cs.AI

Abstract: LLM agents show great promise for complex, multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce $\textbf{Environment Tuning}$, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. $\textbf{Environment Tuning}$ orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.

Summary

  • The paper introduces Environment Tuning, which leverages a structured curriculum, actionable feedback, and fine-grained progress rewards to train multi-turn agents with as few as 400 samples.
  • Empirical results highlight significant improvements, with models like Qwen2.5-7B-Instruct jumping from 7.00% to 36.92% accuracy and robust generalization on out-of-distribution tasks.
  • Ablation studies confirm that each component—curriculum stages, environment augmentation, and precise reward structuring—is essential for stable learning and to prevent training collapse.

Environment Tuning: A Paradigm Shift for Data-Efficient Multi-Turn Agent Training

Motivation and Problem Setting

The paper addresses the challenge of training LLM-based agents for complex, multi-turn tool-use tasks under severe data scarcity. Existing paradigms—Supervised Fine-Tuning (SFT) on synthetic trajectories and standard Reinforcement Learning (RL)—are shown to be inadequate. SFT leads to overfitting and poor generalization, while RL suffers from cold-start issues and instability due to sparse, uninformative feedback and long-horizon credit assignment problems. Figure 1

Figure 1: Limitations of SFT and RL for multi-turn tool-use, and the actionable feedback advantage of Environment Tuning.

The core research question is how to enable robust, generalizable agent learning from a minimal set of problem instances, without reliance on expert demonstrations or large-scale synthetic data.

Environment Tuning Framework

The proposed Environment Tuning paradigm orchestrates agent learning through three tightly integrated mechanisms:

  1. Structured Curriculum: A four-stage progression that decomposes the learning problem, starting from syntactic correctness and culminating in full task generalization.
  2. Actionable Environment Augmentation: The environment is engineered to provide fine-grained, pedagogical feedback upon failure, transforming ambiguous errors into actionable lessons.
  3. Fine-Grained Progress Rewards: Dense, turn-level rewards replace sparse, binary signals, enabling effective credit assignment and stable policy optimization. Figure 2

    Figure 2: The Environment Tuning module implements a four-stage curriculum, dynamically configuring reward, feedback, and data splits to enable stable, efficient learning.

Curriculum Stages

  • Stage 1: Isolates syntactic and schematic correctness, using format and tool-call rewards to ensure the agent can produce valid, parseable actions.
  • Stage 2: Introduces task-oriented reasoning with Progress Reward and augmented feedback, focusing on foundational multi-turn capabilities.
  • Stage 3: Exposes the agent to the full spectrum of complex scenarios (e.g., missing parameters/functions, long context), maintaining dense rewards and augmented feedback.
  • Stage 4: Disables augmentation to align with evaluation conditions, forcing the agent to generalize without scaffolding.

Stage transitions are governed by validation accuracy plateaus and gradient norm stability, preventing premature progression and mitigating RL instability.

Actionable Environment Augmentation

Standard environments return generic or cryptic errors, impeding exploration. The augmented environment provides explicit, context-sensitive hints (e.g., "Invalid airport code[s]:..." or "Paths are not allowed. Specify only file/directory name..."), enabling the agent to infer dependencies and operational constraints through interaction rather than memorization. Figure 3

Figure 3: Multi-turn tool-use scenarios from BFCL V3, illustrating the diversity of reasoning and recovery skills required.

Fine-Grained Progress Reward

The reward at each turn is computed as the product of state and execution correctness, averaged over the episode. This dense signal allows the agent to learn from partial successes and supports efficient exploration in long-horizon, multi-turn settings.

Empirical Results

In-Distribution Performance

On the BFCL V3 multi-turn benchmark, Environment Tuning delivers substantial improvements across both base and SFT-tuned models, using only 400 training samples:

  • Qwen2.5-7B-Instruct: 7.00% → 36.92% (+29.92%)
  • Llama-3.1-8B-Instruct: 5.48% → 28.25% (+22.77%)
  • watt-tool-8B: 35.74% → 54.34% (+18.50%)
  • ToolACE-2-Llama-3.1-8B: 37.99% → 47.18% (+9.19%)

These results surpass several strong open-source and proprietary baselines, demonstrating the method's data efficiency and architecture-agnostic applicability.

Out-of-Distribution Generalization

On OOD tasks (BFCL V4 Web Search/Memory, ACEBench Agent), SFT baselines exhibit severe performance collapse (e.g., xLAM-2: 70.50% → 5.00% on Web Search), while Environment Tuning consistently improves or maintains generalization. For example:

  • Llama-3.1-8B-Instruct: 1.00% → 15.00% (Web Search)
  • ToolACE-2-Llama-3.1-8B: 9.00% → 14.00% (Web Search), 8.34% → 15.00% (ACEBench Agent)

This demonstrates that environment-driven exploration, rather than trajectory imitation, is critical for robust generalization.

Ablation and Training Dynamics

Ablation studies isolate the contributions of each component:

  • Actionable Environment Augmentation: Yields >20% improvement on ambiguous tasks (Missing Parameters/Functions), and stabilizes learning curves.
  • Fine-Grained Progress Reward: Essential for complex splits; binary rewards lead to training failure.
  • Structured Curriculum: Direct RL (single-stage) leads to rapid gradient explosion and performance collapse after ~70 steps, while the curriculum ensures stable, monotonic improvement. Figure 4

Figure 4

Figure 4: Ablation paper for environment augmentation, showing its critical role in stability and performance.

Figure 5

Figure 5: Training collapse in single-stage RL, with accuracy dropping as gradient norm explodes.

Figure 6

Figure 6: A high KL loss coefficient (β=0.1\beta=0.1) is necessary to maintain policy entropy and prevent collapse.

Implementation Considerations

  • Algorithm: Adapted PPO with Group-Relative Policy Optimization (GRPO), decoupled clipping, and a substantial KL penalty (β=0.1\beta=0.1) for entropy preservation.
  • Data: Only 400 problem instances from BFCL V3; no expert demonstrations or large-scale synthetic data.
  • Environment Engineering: Requires domain-specific augmentation to provide actionable feedback, but does not require explicit dependency graphs or handcrafted trajectories.
  • Resource Requirements: The method is compatible with 7B–8B parameter models and can be applied to both base and SFT-tuned architectures.
  • Scalability: The curriculum and augmentation principles are extensible to larger models and more complex, multi-modal environments.

Theoretical and Practical Implications

The results challenge the prevailing paradigm of trajectory-based SFT for agentic tasks, demonstrating that environment-centric, curriculum-driven RL can achieve superior data efficiency and generalization. The approach decouples agent learning from the need for expert demonstrations, instead leveraging environment engineering to provide rich, pedagogical feedback.

Practically, this enables the development of robust, adaptable agents in domains where high-quality demonstrations are prohibitively expensive or unavailable. Theoretically, it suggests that the design of the training environment—specifically, the informativeness and granularity of feedback—can be as critical as the choice of model architecture or optimization algorithm.

Future Directions

  • Automated Curriculum and Feedback Generation: Developing methods to automatically construct curricula and generate actionable feedback, reducing the manual engineering burden.
  • Extension to Multi-Modal and Real-World Environments: Applying Environment Tuning to settings involving vision, speech, or physical actuation.
  • Integration with Model-Based RL: Leveraging learned environment models to further enhance sample efficiency and generalization.
  • Robustness and Safety: Investigating the impact of environment augmentation on agent robustness to adversarial or unexpected scenarios.

Conclusion

Environment Tuning represents a principled, empirically validated approach for training multi-turn, tool-augmented agents under extreme data scarcity. By shifting the focus from imitation of static trajectories to environment-driven exploration, and by leveraging curriculum learning, actionable feedback, and dense rewards, the method achieves both stable learning and strong generalization. This paradigm has significant implications for the design of future agentic systems, particularly in resource-constrained or rapidly evolving domains.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 3 likes.

Upgrade to Pro to view all of the tweets about this paper:

alphaXiv