Task-Aligned Whole-Body Policy Learning

Updated 3 April 2026

The paper presents a unified MDP formulation that integrates whole-body state and action spaces to enable synergistic control of locomotion and manipulation.
It introduces kinematic and physical model integration into reward shaping, which quantitatively improves pose tracking and expands manipulation reach.
Multi-critic architectures combined with curriculum training resolve conflicting task demands, resulting in robust, emergent whole-body behaviors and effective sim-to-real transfer.

Task-aligned whole-body policy learning refers to a class of algorithms, architectures, and training methodologies explicitly designed to produce unified control policies for high-DoF robotic systems—often with multiple functionally distinct subsystems—such that all relevant body parts are synergistically coordinated to directly maximize task success. This paradigm contrasts with hierarchical or decoupled schemes, which separately optimize locomotion, manipulation, and other sub-functions, often at the expense of global coordination. Recent research formalizes task alignment as the integration of physical, geometric, and task-specific priors into the learning problem, leading to robust, efficient, and dexterous controllers capable of executing challenging whole-body behaviors in both simulation and on diverse physical hardware.

1. Unified MDP Problem Formulation and Whole-Body State/Action Spaces

Task-aligned whole-body policy learning is grounded in a Markov Decision Process (MDP) that encapsulates the intertwined state, action, and transition spaces for all relevant robot subsystems. For instance, for a quadruped + manipulator platform, the state vector concatenates body commands and state (desired velocities, roll/pitch, twist), noisy proprioceptive leg and arm joint states, end-effector pose, and previous actions, yielding $s_t \in \mathbb{R}^{69}$ or higher (Hou et al., 6 Jul 2025). The action space is the full set of target joint positions for all limbs (e.g., $a_t \in \mathbb{R}^{18}$ for 12 leg + 6 arm joints).

This comprehensive formalism enables the policy to reason jointly about locomotion and manipulation, capturing all relevant dependencies without the constraint of module boundaries. Transitions are defined implicitly by high-fidelity physics engines (e.g., RaiSim), simulating coupled dynamics, actuator models, and contacts. Such unification forms the basis for leveraging reinforcement learning (RL) to discover emerging synergistic behaviors.

2. Kinematic and Physical Model Integration for RL Guidance

A defining feature of the task-aligned paradigm is the explicit injection of kinematic or physical priors into the learning loop to bias exploration towards globally effective solutions and to mitigate local optima. (Hou et al., 6 Jul 2025) introduces a key innovation by integrating the manipulator's forward kinematics into the RL reward. The feasibility of a given body and arm configuration is quantified via workspace checks—either as a distance in SE(3) between desired and reachable end-effector poses or, more robustly, via binary feasibility predicates derived from inverse kinematics (IK) solutions.

This reward shaping directly guides the policy to exploit body posture (e.g., torso reorientation) to enhance manipulator workspace, thus enabling behaviors such as whole-body pitching to reach low/high targets while maintaining stability and tracking velocity commands within acceptable tolerances. The result is a measurable expansion of the reachable manipulation volume (e.g., ≈34% increase without degrading velocity tracking beyond 3%), along with improvements in pose tracking error (PE ≤ 0.087 m vs. 0.14 m for baselines) and orientation error (RE ≤ 0.18 rad) (Hou et al., 6 Jul 2025).

3. Multi-Critic and Decoupled Reward Learning to Resolve Task Conflicts

Locomotion and manipulation often impose conflicting requirements on global posture (e.g., horizontal base for efficient walking vs. pitched/tilted base for maximized end-effector reach), making a single scalar reward insufficient. Multi-critic architectures explicitly decouple the learning signal for each task component (Vijayan et al., 11 Jul 2025). Separate critic networks are trained for locomotion and manipulation reward terms, yielding individual advantage estimates (e.g., $\hat{A}_{loc}$ , $\hat{A}_{man}$ ), which are then combined—typically via normalization and equal weighting—for updating the policy via PPO.

Additionally, task-aligned policies leverage velocity-aware reward formulations. For manipulation, twist-based objectives directly penalize the deviation between desired and actual end-effector twists (velocity and angular velocity), allowing for smooth trajectory tracking and dynamic compensation during locomotion (Vijayan et al., 11 Jul 2025). This framework enables whole-body policies that—without explicit posture constraints—learn to compromise and produce emergent behaviors where, for example, base tilting is used just enough to optimize workspace while maintaining efficient gaits.

Empirical results show end-effector RMSE near 0.017 m and orientation errors around 1.8°, maintaining precision whether standing or walking and enabling “chicken-head” stabilization under base motion. The same multi-critic policy generalizes robustly to different gait patterns when contact-scheduling reward heads are included.

4. Architecture, Training Algorithms, and Curriculum Methods

Task alignment is further enforced through architectural and training-algorithm choices:

Asymmetric Actor-Critic: The actor (deployed on robot) is limited to noisy, onboard proprioception, ensuring transferability to real hardware. In contrast, the critic (trained only in simulation) has access to privileged signals such as exact contact forces and Jacobians, yielding superior value and advantage estimation (Hou et al., 6 Jul 2025).
Network Architectures: Typically, deep MLPs (e.g., three fully connected layers of 256 units with ReLU) are used for both actor and critic. Policies output joint targets and log-standard deviations for each controllable DoF.
Curriculum Implicitly or Explicitly Built-In: Early training phases weigh the kinematic or feasibility-guidance reward more heavily to ensure rapid expansion of reachability and avoidance of bad local optima. Later, task-specific and regularization terms take prominence, shaping fine-grained behavior (Hou et al., 6 Jul 2025).
Training Pseudocode: Policies are typically optimized with proximal policy optimization (PPO), using generalized advantage estimation (GAE). The full training cycle involves collecting rollouts in massive parallel simulation, computing shaped rewards (including feasibility terms), calculating advantages, and updating both policy and value networks using clipped objectives.

These choices yield sample-efficient, robust controllers that can be directly deployed on hardware platforms such as the DeepRobotics X20 with Unitree Z1 manipulator (Hou et al., 6 Jul 2025).

5. Reward Design: Balancing Task Alignment, Regularization, and Physical Realism

Success in whole-body policy learning hinges on the design of multi-term reward functions that balance task accomplishment, physical realism, and regularization:

Task Terms: Encapsulate direct goals for locomotion (velocity and angular rate tracking) and manipulation (pose or twist tracking for the end-effector). Weightings are chosen to balance sub-tasks according to application needs (e.g., $w_v=0.5$ , $w_\omega=0.3$ , $w_p=0.6$ , $w_k=0.16$ ).
Feasibility/Kinematic Rewards: Provide dense signals for physical achievability (e.g., reward for any torso posture from which an IK solution exists).
Regularization: Penalize high torque, torque transients, joint limit violations, collision events, and foot slippage. These are crucial for sim-to-real transfer by minimizing unphysical behaviors.
Curricula: Reward weights are adaptively tuned to focus the policy on the most difficult or underexplored aspects throughout the learning process. For instance, the feasible reward dominates initially; as policy competency increases, regularization and high-fidelity task alignment become more prominent.

The combined objective enables the emergence of sophisticated behaviors, such as smooth body tilting during arm extension, robust base tracking, and precise manipulation even under dual-task requirements.

6. Experimental Evaluation, Real-World Transfer, and Emergent Behaviors

Experimental setups span both large-scale simulation (e.g., 2048–5000 parallel environments) and real-robot platforms. Tasks include velocity and pose tracking, large-range manipulation, and complex behaviors (e.g., ribbon waving, wide-area filming, kneel-and-grasp, cart pushing) (Hou et al., 6 Jul 2025).

Key evaluation metrics incorporate:

Inverse Kinematics (IK) Solution Rate: Fraction of planner targets kinematically reachable, directly assessing workspace expansion.
Velocity Tracking Error (LVTE, AVTE): Quantifies locomotion precision.
Pose Tracking Error (PE, RE): Measures end-effector accuracy in position (ℝ³) and orientation (SO(3)).
Comparative Performance: Unified, task-aligned policies achieve significantly superior workspace coverage, precision, and robustness relative to decoupled or separate-module baselines, with statistical improvements in all key metrics.

Importantly, emergent whole-body behaviors are observed, such as synergistic body-arm actions to reach challenging poses, adaptive compromise in posture across conflicting tasks, and generalization to unseen command distributions—further validating the efficacy of task-aligned formulations.

7. Relationships to Broader Literature and Extensions

Task-aligned whole-body policy learning subsumes and extends prior approaches by:

Rejecting naive task decomposition: Avoids manual partitioning of action space and sub-objective assignment, which risks loss of synergy and limits performance on coupled tasks (Hu et al., 2023, Fu et al., 2022).
Emphasizing automatic discovery of action–reward causality: Causal policy gradient frameworks use statistical measures to align each action dimension to its directly affected reward term, yielding variance reduction and faster convergence (Hu et al., 2023).
Providing a scaffold for transferable, generalizable, and robust controllers: Task alignment enables high success rates in out-of-distribution conditions and on real hardware (Hou et al., 6 Jul 2025, Vijayan et al., 11 Jul 2025).

Recent research further explores alternatives such as adversarial optimization between body subsystems (Shi et al., 19 Apr 2025), sequential contact-phase decomposition for complex whole-body tasks (Zhang et al., 2024), and curriculum or multi-expert architectures to span large behavior sets (Yang et al., 22 Dec 2025).

References:

"Efficient Learning of A Unified Policy For Whole-body Manipulation and Locomotion Skills" (Hou et al., 6 Jul 2025)
"Multi-critic Learning for Whole-body End-effector Twist Tracking" (Vijayan et al., 11 Jul 2025)
"Causal Policy Gradient for Whole-Body Mobile Manipulation" (Hu et al., 2023)
"Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion" (Fu et al., 2022)
"Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning" (Shi et al., 19 Apr 2025)
"WoCoCo: Learning Whole-Body Humanoid Control with Sequential Contacts" (Zhang et al., 2024)
"EGM: Efficiently Learning General Motion Tracking Policy for High Dynamic Humanoid Whole-Body Control" (Yang et al., 22 Dec 2025)