Humanoid Control via Reinforcement Learning

Updated 20 December 2025

Humanoid controllers via reinforcement learning are advanced systems that synthesize sensorimotor policies to manage high-dimensional, underactuated robot dynamics.
They employ hierarchical, hybrid, and end-to-end architectures to decouple planning from actuation, enabling versatile locomotion, manipulation, and robust disturbance rejection.
Integration of on-policy and off-policy RL techniques with strategies like domain randomization and reward shaping facilitates effective sim-to-real transfer and enhanced stability.

A humanoid controller via reinforcement learning is a control synthesis paradigm that leverages reinforcement learning (RL) to generate sensorimotor policies for complex, high-dimensional humanoid robots. These controllers are designed for tasks such as dynamic locomotion, push recovery, whole-body manipulation, human-like motion tracking, compliance under force, and zero-shot generalization. Modern approaches architecturally decouple high-level planning from low-level actuation, using hybrid models, hierarchical planning, multi-agent specialization, or end-to-end learning frameworks. Unlike classic model-based controllers that rely on analytic models with limited adaptability, RL-based humanoid controllers can learn robust responses in the presence of model inaccuracies, environmental uncertainty, and unmodeled disturbances—often leading to performance superior to traditional approaches in versatility and resilience.

1. Algorithmic Architectures and Control Structure

Contemporary RL-based humanoid controllers range from hierarchical and hybrid designs to unified end-to-end architectures, with selection contingent on hardware constraints, task complexity, and physical model fidelity.

Hierarchical/planner-based approaches: Early architectures couple a dynamic locomotion planner (e.g., phase-space planner or prismatic inverted pendulum model) with an RL policy that plans step timing and foot placement, providing goal-level or trajectory inputs to a whole-body controller (WBLC) that produces joint torques under prioritized multi-task QP control. These methods exploit analytic models for tractability and real-time guarantees while RL augments robustness to disturbances (Kim et al., 2017).
Dual-/multi-agent decompositions: Recent work splits the control space into limbs (e.g., separate RL policies for lower-body locomotion and upper-body manipulation) coordinated through shared observations, centralized training, and coupled rewards. This modularization manages the curse of dimensionality and improves composite performance in challenging environments, such as those with strong force interactions or compliance requirements (Dong et al., 25 Nov 2025, He et al., 29 Sep 2025, Lee et al., 5 Jul 2025).
Residual and hybrid controllers: Hybrid LMC fuses model-based LQR with an ensemble of neural Soft Actor-Critic (SAC) policies in a residual structure, using stabilized feedback early and learning residuals as experience accumulates for balancing wheeled humanoids (Baek et al., 2022). Residual RL overlays learned corrections on open-loop reference trajectories to inject adaptability and human-like agility (Zhang et al., 5 Feb 2025).
End-to-end policies: Direct mapping from raw or preprocessed proprioceptive history (sometimes with exteroceptive/task cues) to joint commands using MLPs, transformers, or structured state-space models (e.g., Mamba) is increasingly feasible with large-scale simulation and high-frequency actuation (Radosavovic et al., 2023, Wang et al., 22 Sep 2025).
Foundation model and zero-shot control: Unsupervised RL with policy regularization toward coverage of behavioral datasets can train latent-parametrized policies suited to zero-shot whole-body control, motion imitation, and downstream goal/reward tasks without retraining (Tirinzoni et al., 15 Apr 2025).

2. Core RL Algorithms and Training Procedures

State-of-the-art humanoid RL controllers utilize both on-policy and off-policy algorithms with modifications for scalability and stability.

On-policy methods: Proximal Policy Optimization (PPO) is dominant for whole-body joint-space and velocity tracking, particularly in high-dimensional settings where stability and fast reward shaping are critical (Dong et al., 25 Nov 2025, Yan et al., 14 May 2024, Zhang et al., 5 Feb 2025, Ding et al., 10 May 2025, Radosavovic et al., 2023, Wang et al., 22 Sep 2025).
Off-policy methods: Deterministic (TD3) and stochastic (SAC) off-policy algorithms are preferred for their sample efficiency and compatibility with experience replay—when paired with large parallelized environments and distributional critics, they achieve rapid convergence on tasks of high complexity (Seo et al., 28 May 2025, Baek et al., 2022).
Actor-critic variants: Dual-agent and multi-agent PPO with centralized critics and decentralized actors enable modular training and execution, with critics accessing privileged or global state information for more effective credit assignment (Dong et al., 25 Nov 2025, Lee et al., 5 Jul 2025).
Self-imitative and residual RL: Explicit self-imitation losses, dynamically weighted by trajectory return, sharply accelerate exploration in sparse-reward, high-DoF spaces (Zhuang et al., 24 Feb 2025). Residual architectures overlay corrections atop open-loop or analytic references to blend model-based and learned control (Zhang et al., 5 Feb 2025, Yan et al., 14 May 2024).

Training procedures universally leverage massive GPU-parallel simulation, curriculum scheduling (for disturbance magnitude or reference velocity), domain randomization (physical parameters, external forces, delay), and reward normalization for convergence and sim-to-real robustness (Dong et al., 25 Nov 2025, Zhang et al., 11 Mar 2025, Radosavovic et al., 2023).

3. State, Action, and Observation Design

Detailed selection of policy input/output is crucial given the high DoF and underactuated dynamics of humanoids.

State (observation) vectors: These typically contain high-dimensional proprioceptive measurements (joint angles, velocities), floating-base orientation/velocity, foot contact flags, ground reaction forces, delayed action histories, gravity/projected vectors, and task/goal labels (e.g., desired velocity, footstep targets, or latent context) (Wang et al., 22 Sep 2025, Dong et al., 25 Nov 2025, Lee et al., 5 Jul 2025). Multi-agent and dual-level architectures partition observation vectors by agent, with shared global and local features.
Action spaces: Most policies parameterize actions as target joint positions or residuals, suitable for tracking by high-frequency PD/impedance/torque controllers. Some advanced designs output feedforward torques, task-space stiffness, or compliance modulation coefficients for dynamic force regulation (He et al., 29 Sep 2025, Atamuradov, 15 Nov 2025).
Observation normalization and augmentation: Training routines ubiquitously normalize observations online to stabilize gradient flow. Augmented features (history windows, reference trajectories, contact flags, task embeddings) enable richer behavior and adaptability (Wang et al., 22 Sep 2025, Radosavovic et al., 2023).
Privileged information in critics: Asymmetric actor-critic frameworks expose non-observable simulation variables (e.g., external force, spring–damper states, exact inertias) to the critic only, boosting sample efficiency without violating execution constraints (Dong et al., 25 Nov 2025).

4. Reward Function Engineering and Physical Evaluation

Reward function design varies by task—ranging from analytic or model-inspired shaping to latent adversarial regularization.

Task/goal tracking: Standard terms include exponential penalties for deviation of CoM/base velocity, position, and orientation, as well as explicit terms for foot placement, force smoothness, swing leg clearance, and joint-level posture (Zhang et al., 5 Feb 2025, Ding et al., 10 May 2025, Radosavovic et al., 2023, Wang et al., 22 Sep 2025).
Imitation and style: Bounded-residual and motion-tracking policies maximize style-weighted similarity between robot state and reference human motion, augmented by root pose and contact pattern synchronization (Yan et al., 14 May 2024, Zhang et al., 5 Feb 2025).
Stability and safety: Terms for minimal control effort, power consumption, torque change, and deviation from stable postures are employed to regularize exploration and limit unphysical behavior (Wang et al., 22 Sep 2025, Dong et al., 25 Nov 2025).
Compliance and interaction: For tasks under significant external force, explicit force-tracking, compliance modulation, and disturbance rejection rewards facilitate robust whole-body behavior (Dong et al., 25 Nov 2025, He et al., 29 Sep 2025).
Physical or theoretical certificates: CLF-guided rewards replace heuristic shaping, penalizing deviations from certified stability (e.g., Lyapunov function decrease rates), and yield interpretable, provably stable behavior across domains (stance, flight) (Olkin et al., 23 Sep 2025).
Unsupervised regularization: Behavioral foundation models exploit representation learning and adversarial regularization to align unsupervised RL exploration with motion-capture datasets for broad zero-shot coverage (Tirinzoni et al., 15 Apr 2025).

Physical evaluation commonly involves disturbance robustness (push magnitude/direction statistics), tracking errors (root/joint/force), smoothness/energy consumption, and (where feasible) deployment to hardware for sim-to-real validation (Dong et al., 25 Nov 2025, Zhang et al., 11 Mar 2025, Radosavovic et al., 2023, Zhang et al., 5 Feb 2025).

5. Real-World Deployment and Sim-to-Real Transfer

Transferring RL-trained controllers from simulation to hardware is a central challenge, addressed through multiple complementary mechanisms:

Domain Randomization: Additive and multiplicative perturbations to robot masses, friction coefficients, actuator gains, latency, contact models, force disturbances, and task goals are used to close the simulation–reality gap (Radosavovic et al., 2023, He et al., 29 Sep 2025, Zhang et al., 5 Feb 2025).
Architectural design for transfer: Modular dual- or multi-agent policies, limb-level decomposition, and explicit separation of high-level and low-level controllers facilitate robust execution and fault tolerance in hardware (Lee et al., 5 Jul 2025, Ding et al., 10 May 2025, Dong et al., 25 Nov 2025).
Compliance control: Hybrid controllers with model-based or SPD-manifold compliance blending enable stable and robust manipulation tasks, adaptive to changing environmental stiffness and external forces (He et al., 29 Sep 2025).
Generalization mechanisms: RL policies trained over wide variations in environment, agent appearance, and reference/task achieve rapid adaptation to novel situations, including unseen terrain and objects (Radosavovic et al., 2023, Yan et al., 14 May 2024, Tirinzoni et al., 15 Apr 2025).
Reported sim-to-real outcomes: State-of-the-art controllers achieve >90% stability under strong disturbances, low tracking errors (e.g., upper-body tracking error reduced from 0.5 rad to 0.16 rad, walking error <0.02 rad), and operate robustly across diverse platforms and manipulation tasks (Dong et al., 25 Nov 2025, Zhang et al., 5 Feb 2025, Zhang et al., 11 Mar 2025, Radosavovic et al., 2023).

6. Applications, Evaluations, and Research Advances

RL-based controllers have realized new capabilities beyond traditional methods:

Dynamic and complex locomotion: Real-time multi-contact walking, running with flight phases, rough terrain traversal, omnidirectional steering, and fast push recovery surpass ZMP-based and pure analytic planners in versatility (Kim et al., 2017, Olkin et al., 23 Sep 2025, Zhang et al., 11 Mar 2025).
Whole-body loco-manipulation: Coupled walking and object manipulation under strong interaction forces (e.g., rope pulls, platform suspension) are possible via dual-agent force-adaptive RL and compliant control (Dong et al., 25 Nov 2025, He et al., 29 Sep 2025).
Motion imitation and multi-robot generalization: Single RL policies can simultaneously imitate human movements over thousands of diverse tasks and generalize across heterogeneous robot morphologies without per-task reward retuning (Yan et al., 14 May 2024, Tirinzoni et al., 15 Apr 2025).
Emergent and interpretable behaviors: High-fidelity reward design and proper architecture result in limb-level strategies such as human-like arm swing (for angular momentum dampening), knee locking for passive stability, underactuated toe/heel push-off, and adaptive compliance (Lee et al., 5 Jul 2025, Yang et al., 2018).
Data efficiency and accelerated learning: Self-imitative RL, distributional value critics, massive parallelization, and curriculum-based progression have reduced sample complexity by factors of 2–3× and wall-clock times for high-DoF control to under three hours (Seo et al., 28 May 2025, Zhuang et al., 24 Feb 2025).

Principal limitations center on the sim-to-real gap (especially for perception-limited tasks), force/torque observability, model inaccuracies, and hierarchical coupling assumptions in policy modularization. Research is ongoing in end-to-end perception-action integration, onboard-only sensor architectures, automatic hierarchy learning, curriculum optimization, and reinforcement of physical certificates (CLF, passivity) (Olkin et al., 23 Sep 2025, Dong et al., 25 Nov 2025, Yan et al., 14 May 2024, Tirinzoni et al., 15 Apr 2025).