- The paper identifies state information richness and planning complexity as key predictors of robust cross-domain generalization in RL-trained LLM agents.
- It demonstrates that lightweight state randomization improves out-of-domain success rates by 5.7% to 42.5%, enhancing overall robustness.
- The study shows that enforcing explicit step-by-step reasoning during RL training helps maintain performance when deployed in new, structurally distinct environments.
Paying Less Generalization Tax: Cross-Domain Generalization in RL Training for LLM Agents
The paper "Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents" (2601.18217) investigates the challenge of agentic post-training for generalist LLM agents when the eventual deployment domains are unknown and potentially disjoint from training environments. RL-based agentic fine-tuning remains constrained by available domain simulators, which are both expensive to create and inherently narrow in coverage. Despite strong in-domain performance after RL, substantial drops are observed in out-of-domain (OOD) scenarios—characterizing a "generalization tax." The study asks: which environment properties and training strategies most effectively preserve agentic capability across unseen, structurally distinct domains?
Analytical Framework and Environments
The analysis spans four agentic RL environments: WebShop (web navigation), Sokoban (grid-based planning), ALFWorld (household tasks), and SciWorld (complex laboratory reasoning), which are chosen specifically for coverage across different axes of state representation and planning logic. RL training and evaluation are performed using Llama-3.1-8B-Instruct, initialized from either a base RL checkpoint or an SFT-warmed policy incorporating multiple domains (V1 and V2 checkpoints). Success rates (pass@1) are measured in-domain and OOD, and an OOD ranking score is computed over cross-domain evaluations.
Key Findings: Environmental Determinants of Generalization
Two environment-side factors exhibit strong correlation (and, via augmentation, causal linkage) with cross-domain generalization:
- State Information Richness: Quantified by observation character count, environments that force the agent to process denser or noisier input foster more robust policies. Dense states demand active filtering for task relevance and mitigate shortcut exploitation.
- Planning Complexity: Estimated by average trajectory length, high-complexity domains require agents to orchestrate multi-step planning chains—yielding policies that generalize reasoning structure rather than shallow heuristics.
Contrary to prevailing expectations, domain realism and semantic similarity are not predictive; Sokoban, an abstract puzzle, delivers superior generalization than ALFWorld, a realistic household simulation. These findings are not simply reflective of knowledge acquisition during RL (i.e., specialization does not inevitably degrade OOD performance); rather, the intrinsic structure of the training environment is a dominant factor.
To move from correlation to causal validation, the study introduces a low-overhead state randomization method—injecting controlled quantities of goal-irrelevant, distractive textual content into the agent's observations during RL training. This increases perceived information load without altering the task, thereby compelling the agent toward more robust filtering and generalizable perception. Across both checkpoints and all environments, state augmentation improves OOD success rates by 5.7–42.5% (with the exception of extreme noise quantity), providing a scalable mechanism for robustness preservation. The intervention is domain-agnostic and does not require task structure modification or simulator design changes.
Modeling Choices Impacting OOD Retention
Two salient design choices further modulate generalization:
- SFT Warmup/Mid-Training: Incorporating data from multiple domains via supervised demonstration SFT (before RL) sharply reduces knowledge forgetting during subsequent RL fine-tuning, but at the price of catastrophic erasure for domains not covered in the SFT mix. Retentivity in covered domains is robust, while OOD performance outside the SFT mix can be substantially worse.
- Step-by-Step Explicit Reasoning: Enabling explicit reasoning as part of the RL policy—requiring agents to articulate stepwise plans in their outputs—has marginal impact on in-domain performance but is critical for OOD generalization. Policies trained reactively (i.e., direct action prediction without reasoning traces) frequently collapse in unseen domains despite comparable or higher in-domain effectiveness.
Theoretical and Practical Implications
The findings advocate for a shifting paradigm in agentic RL training for LLMs: strategic selection or synthesis of domains with high information richness and planning complexity is more beneficial for generalization than maximizing realism or semantic overlap. Lightweight state augmentation can further bridge gaps when domain diversity is constrained. Conversely, routine SFT mid-training must be carefully controlled to avoid over-specialization and forgetting, with wide datamix preferred in settings with uncertainty about deployment targets. Step-by-step reasoning emerges as a necessary property for agnostic agentic capability transfer.
In the theoretical domain, these results reinforce the view that generalization in RL for LLM agents is tightly coupled to environmental structural regularity and perception complexity rather than surface similarity or raw experience volume. The results suggest that optimization over denser, more complex state-action spaces may be closer to learning process-invariant policies, thereby reducing overfitting and boosting OOD robustness.
Future Directions
Potential future work includes scaling to more diverse and challenging domains, formalizing environmental complexity measures, automating state augmentation strategies, and integrating adaptive SFT schedules. Mechanistic theoretical understanding of why state/perception richness, rather than realism or semantics, dominates generalization remains open. Extensions into deeper scale exploration and the interaction between causal reasoning chains and knowledge retention strategies are warranted.
Conclusion
The paper elucidates that to "pay less generalization tax," agentic RL training for LLMs should prioritize environments with higher state information richness and planning complexity, incorporate explicit reasoning, and utilize lightweight state randomization. These interventions and selection strategies provide practical and theoretically grounded guidance for robust, generalist agentic post-training, especially when the eventual test environments are unknown.