Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning (2512.09706v1)

Published 10 Dec 2025 in cs.LG

Abstract: The paradigm of agentic AI is shifting from engineered complex workflows to post-training native models. However, existing agents are typically confined to static, predefined action spaces--such as exclusively using APIs, GUI events, or robotic commands. This rigidity limits their adaptability in dynamic environments where the optimal granularity of interaction varies contextually. To bridge this gap, we propose CrossAgent, a unified agentic model that masters heterogeneous action spaces and autonomously selects the most effective interface for each step of a trajectory. We introduce a comprehensive training pipeline that integrates cold-start supervised fine-tuning with a Multi-Turn Group Relative Policy Optimization (GRPO) algorithm. This approach enables the agent to learn adaptive action switching--balancing high-level efficiency with low-level precision--without human-specified rules. Extensive experiments on over 800 tasks in the open-world Minecraft environment demonstrate that CrossAgent achieves state-of-the-art performance. By dynamically leveraging the strengths of diverse action spaces, our model significantly outperforms fixed-action baselines, exhibiting superior generalization and efficiency in long-horizon reasoning. All code and models are available at https://github.com/CraftJarvis/OpenHA

Abstract PDF Chat (Pro)

Summary

The paper presents a unified RL model, CrossAgent, that dynamically selects among heterogeneous action spaces.
It employs a three-stage pipeline (SFT, STRL, and MTRL) to achieve context-aware action selection and improved sample efficiency.
Empirical results in Minecraft tasks demonstrate high success rates and strong out-of-distribution generalization.

A Unified Agentic Model for Dynamic Action-Space Selection via Reinforcement Learning

Introduction

The challenge of scaling generalist agents to complex, open-ended environments lies in the diversity of skill representations and the essential requirement to transition fluently between different interaction interfaces. Conventional approaches anchor agents to a fixed action abstraction—atomic movements, GUI events, API calls, or natural language commands. This design bottleneck fundamentally restricts adaptability and hinders generalization over heterogeneous environments.

"Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning" (2512.09706) proposes CrossAgent, a unified agentic model that autonomously learns to select and switch among heterogeneous action spaces—ranging from low-level primitive controls to high-level abstract commands—at each step in a trajectory. The model is instantiated and evaluated in the Minecraft environment, encompassing both embodied control and GUI-based interaction, and utilizes a comprehensive reinforcement learning pipeline.

Figure 1: CrossAgent dynamically switches between different action spaces to adapt to complex task requirements, eschewing static, fixed interaction modes.

CrossAgent Framework and Action-Space Design

CrossAgent is formulated as a policy over a union of $N$ heterogeneous action spaces $\mathcal{A} = \bigcup_{x=1}^N \mathcal{A}_x$ with each subspace corresponding to a unique control interface (e.g., motion primitives, spatial grounding, keyboard and mouse events). At each timestep, the agent jointly chooses an action space and an action within that space, optimizing a composite objective that balances task rewards and context-dependent execution costs.

The paper systematically categorizes action abstraction levels prevalent in existing agentic models:

Raw actions: Direct low-level interface controls (keyboard/mouse).
Motion actions: Parameterized movement or navigation primitives.
Grounding actions: Perceptual grounding of actions to spatial scene coordinates or objects.
Language actions: High-level, semantic task descriptions.
Latent actions: Abstract, learned sub-behaviors emergent from offline or self-supervised data.

The central design innovation is that action-space selection is not hardwired but induced as a learnable component, enabling in-context adaptation and interface blending over task phases.

Reinforcement Learning Training Pipeline

The CrossAgent training pipeline comprises three progressive optimization stages:

Figure 2: The CrossAgent training regimen incrementally builds adaptive multi-action-space capabilities via Cold-Start SFT, STRL, and MTRL.

Stage 1: Cold-Start Supervised Fine-Tuning (SFT)

A vision-LLM (Qwen2-VL-7B-Instruct) is first fine-tuned on a balanced, composite action-space dataset derived from Minecraft online tasks. The base model thus acquires the syntactic grounding to decode and emit valid actions across all interfaced modalities, but without strategic action-space selection.

Stage 2: Single-Turn Reinforcement Learning (STRL)

Here, the model learns to discriminate and prefer action subspaces based on immediate context. Group Relative Policy Optimization (GRPO) is applied on one-step tasks with a reward function agnostic to surface form, evaluating only the underlying environmental effect of the executed action. This yields a policy that stochastically explores action spaces and begins to develop local context-aware action-space preferences.

The inclusion of STRL is empirically crucial. Models trained with STRL converge faster and reach higher asymptotic performance in downstream multi-turn RL, as demonstrated in ablation.

Figure 3: STRL-augmented CrossAgent exhibits marked sample efficiency and improved convergence in subsequent multi-turn RL.

Stage 3: Multi-Turn Reinforcement Learning (MTRL)

The final stage involves episodic trajectory optimization via GRPO, using sparse, long-horizon rewards directly tied to eventual task completion. To prevent mode collapse, the action-space selection policy distilled from STRL is initially preferred. The model ultimately learns to orchestrate granular and abstract actions to maximize long-horizon success rates and resource efficiency.

Empirical Analysis and Results

CrossAgent is evaluated across the OpenHA benchmark suite—over 800 Minecraft tasks, spanning navigation, combat, and GUI-based crafting. The experiments place strong emphasis on task diversity, requiring robust phase-specific adaptation due to the heterogeneity of interface demands.

Key empirical findings:

Omni-category SOTA performance: CrossAgent achieves best or strong second-best average success rates in all categories (e.g., 94.7% on Mine Blocks; 83.3% on Craft Items).
Strong out-of-distribution generalization: Although RL is conducted only on 30 tasks, CrossAgent generalizes consistently to over 800 tasks, indicating robust transfer across task distributions.
Significant ablation margins: Joint-space training (CrossAgent) decisively surpasses single-space baselines (e.g., MotionHA, GroundingHA, RawHA), both in convergence and final success rate.
Figure 4: CrossAgent’s heterogeneous action policy improves data efficiency and final returns in comparison to fixed-action-space agents.

STRL importance: Removing the STRL phase leads to a notable drop-off in OOD generalization, especially in tasks requiring non-trivial interface adaptation.

Dynamic Action-Space Adaptation: Qualitative Analysis

Case studies further illustrate how CrossAgent’s switching policy is intrinsically context- and phase-aware, not stochastic. In the Kill Sheep task, the agent utilizes motion primitives for terrain traversal, spatial grounding for entity targeting, and atomic actions for high-frequency attacks, adapting interface choice seamlessly with task progression.

Figure 5: The density of chosen action spaces dynamically shifts with progression through different task phases, underlying in-context adaptation.

Figure 6: Example action sequences show semantically coherent interface transitions, e.g., from raw movement to GUI operations.

Implications and Forward Directions

CrossAgent’s learning-based action-space arbitration addresses the critical bottleneck of static interface design in embodied agent architectures. Theoretical implications include:

Learned hierarchy: The policy implicitly reasons over hierarchies of abstraction, aligning with optimal option discovery perspectives.
Robustness across distribution: Unified multi-space policies mitigate overfitting and mode collapse, outperforming domain-specialized experts.
Efficient RL adaptation: The two-stage RL (STRL+MTRL) protocol enhances both sample efficiency and robustness, suggesting that step-level and trajectory-level policy regularization is valuable in open-world RL.

Practical avenues for future investigation include sim-to-real transfer for physical robotic systems—where interface switching also reflects reliability and safety constraints—and reduction in sample complexity for online RL by further exploiting offline or curriculum-based paradigms.

Conclusion

This work demonstrates that learning the arbitration over heterogeneous action spaces is tractable, empirically valuable, and vital for scalable, generalist agent architectures. By integrating flexible action-space selection into the reinforcement learning objective, embodied agents achieve stronger generalization and robustness. The CrossAgent framework establishes a clear foundation for unified agentic models in open-world settings, with direct implications for the design of future multi-modal, multi-interface autonomous systems.