Papers
Topics
Authors
Recent
2000 character limit reached

Sim-to-Real Humanoid Locomotion

Updated 3 December 2025
  • Sim-to-real humanoid locomotion is the systematic transfer of simulation-trained control policies to physical robots, ensuring robust performance despite the reality gap.
  • It employs precise actuator modeling, domain randomization with noise injection, and curriculum learning to mitigate discrepancies between simulated and real-world environments.
  • Advanced policy innovations such as history embedding, contrastive representation learning, and explicit state estimation enable zero-shot transfer and high-fidelity hardware performance.

Sim-to-real humanoid locomotion refers to the systematic transfer of control policies, typically obtained via simulation-based reinforcement learning or optimization, onto actual humanoid robot hardware, with the goal of achieving robust, adaptive, and high-performance locomotion across diverse real-world environments despite the significant "reality gap" between simulated physics and physical actuation, sensing, and contact phenomena. This field addresses critical challenges related to model inaccuracy, actuator and transmission nonlinearity, partial observability, and safety constraints unique to high-dimensional, underactuated humanoid robots.

1. Reality Gap and Actuator Modeling in Humanoid Sim-to-Real

A fundamental obstacle for sim-to-real transfer of bipedal walking is the reality gap, particularly exacerbated by actuator characteristics (gear reduction, backdrivability, friction), hardware constraints (sensor limitations), and contact uncertainties.

In gear-driven humanoids—where high reduction-ratio gears introduce pronounced non-backdrivable compliance—neglecting actuator and transmission losses in simulation causes major policy failures on hardware. To mitigate this, precise actuator modeling is critical. For example, "Sim-to-Real Transfer of Compliant Bipedal Locomotion on Torque Sensor-Less Gear-Driven Humanoid" (Masuda et al., 2022) introduced a detailed dynamic model of DC motors with current control: τm=KtI,I=VpwmVback-emfRter,Vback-emf=Ktq˙\tau_m = K_t I, \quad I = \frac{V_{\rm pwm} - V_{\text{back-emf}}}{R_{\rm ter}},\quad V_{\text{back-emf}} = K_t \dot{q} and a Directional Transmission Efficiency (DTE) gear/actuator model, capturing asymmetric losses and backdrivability: τbrake(τm,τa)={(1ηfw)τm,if sign(ηfwτm+τa)=sign(τm) (1ηbw)τa,if sign(τm+ηbwτa)=sign(τa) (τm+τa),otherwise\tau_{\rm brake}(\tau_m, \tau_a) = \begin{cases} -(1 - \eta_{\rm fw})\tau_m, & \text{if} \ \mathrm{sign}(\eta_{\rm fw}\tau_m + \tau_a) = \mathrm{sign}(\tau_m)\ -(1 - \eta_{\rm bw})\tau_a, & \text{if} \ \mathrm{sign}(\tau_m + \eta_{\rm bw}\tau_a) = \mathrm{sign}(\tau_a)\ -(\tau_m + \tau_a), & \text{otherwise} \end{cases} including Stribeck friction: τfric=fc+exp(q˙/q˙static)(fsfc)+kvq˙\tau_{\rm fric} = f_c + \exp(-|\dot{q}| / \dot{q}_{\rm static}) (f_s - f_c) + k_v |\dot{q}| This modeling, paired with a two-stage system identification methodology using both excitation and failed trials for reward-distribution alignment (via Wasserstein-1 distance), enables hardware-stabilized, torque-sensorless, compliant walking—even on uneven surfaces and under impulsive disturbances—where baseline domain randomization fails (0/3 transfers).

2. Domain Randomization, Noise Injection, and Curriculum Strategies

Domain randomization (DR) remains a centerpiece for many sim-to-real pipelines, broadening the set of simulated environments by perturbing physical parameters (mass, friction, link inertia) and sensing/actuation noise to force policy robustness to unmodeled discrepancies (Gu et al., 8 Apr 2024, Wang et al., 9 Mar 2024, Wang et al., 18 Jun 2025). However, DR is often necessary but not sufficient for sim-to-real transfer, especially in the presence of hardware-dependent actuation effects and complex task requirements.

Advanced alternatives include nonparametric noise injection. The "Joint Torque Space Perturbation Injection" framework (Cha et al., 9 Apr 2025) samples an episode-specific, state-dependent torque-space perturbation τϕ(spriv,t)\tau_\phi(s_{\text{priv}, t}) given by a zero-bias, randomly initialized MLP, augmenting the simulator with a much richer variety of disturbance functions than hand-tuned pDRp_{\rm DR} parameter sampling. Empirically, policies trained via such perturbations demonstrated quantifiably superior hardware robustness—surviving unseen actuator stiffness or contact compliance that universally crashed DR policies.

Another critical approach is the use of curriculum learning: e.g., "Robust Humanoid Walking on Compliant and Uneven Terrain" (Singh et al., 18 Apr 2025), which phases training by starting with flat, rigid floors and incrementally introducing increasing terrain variability (randomized compliance, step heights, bump placement) and robot dynamics perturbations. This multi-stage curriculum prevents brittle overfitting and ensures the emergence of adaptable, aperiodic walking policies transferable to the HRP-5P hardware.

3. Representational and Policy Learning Innovations

Modern sim-to-real locomotion leverages advanced neural policy architectures to encode history, infer latent state, and compensate for partial observability.

History embedding is critical; causal transformers and LSTM architectures allow in-context adaptation to actuator and environment parameters by reading off hidden latent states from observation/action histories (Radosavovic et al., 2023, Radosavovic et al., 29 Feb 2024). For example, transformer-based policies in (Radosavovic et al., 2023) deployed on Digit used histories of length 16, mapping previous [,ot1,at2,ot,at1][\ldots, o_{t-1}, a_{t-2}, o_t, a_{t-1}] into the action ata_t, which enabled zero-shot transfer across terrains, handling impulsive disturbances and acting robustly under massive domain randomization.

Contrastive representation learning further distills privileged simulator state into deployable actor latents. In (Lu et al., 16 Sep 2025), the critic is furnished with terrain height-maps and privileged physical knowledge, while the actor is trained to produce latent states (via GRU) that are contrastively aligned with the critic's feature embeddings. This "distilled awareness" enables the actor to proactively modulate an adaptive gait clock using only proprioceptive input, yielding robust traversal of 30 cm steps and 26.5° slopes zero-shot on hardware—benchmarking well above domain randomized or fixed-clock baselines.

Explicit learned state estimation is another axis of progress. Quantitative saliency analysis in (Wang et al., 9 Mar 2024) demonstrates that explicit base linear velocity estimation—derived from short windows of proprioception—provides the dominant signal (>80% saliency) for sim-to-real performance, with local heightmap estimation further boosting transfer reliability on unstructured terrains.

4. Reward Design, Constraints, and Safety

Reward function design determines the learned policy’s priorities and is a primary tuning knob for balancing locomotor performance, stability, and transferability.

Hierarchical reward compositions are common, mixing velocity and orientation tracking, center-of-mass stabilization, energy efficiency, smoothness, and contact regularity (Cha et al., 9 Apr 2025, Gu et al., 8 Apr 2024). Humanoid-Gym (Gu et al., 8 Apr 2024) uses Gaussian kernels on velocity/orientation errors plus multi-term regularization. More advanced pipelines integrate comfort and safety, e.g., via Control Barrier Function (CBF) costs within a Constrained Markov Decision Process (CMDP) framework (Wang et al., 11 Aug 2025). Here, CBF-derived instantaneious costs enforce collision-free proximity and joint/torque safety constraints during both simulation and hardware deployment. This approach produces demonstrably better navigation around static/dynamic obstacles and greater comfort in human-robot interaction settings.

Smoothness and actuator-friendly policies are critical for sim-to-real. "Lipschitz-Constrained Policies" (LCP) (Chen et al., 15 Oct 2024) enforce action smoothness via a differentiable gradient penalty on the policy, eliminating the requirement for non-differentiable filters or additional reward shaping for smoothness, and yielding robust real-hardware deployment with low mechanical wear and energy consumption.

5. System Identification, Perception, and Sensing

System identification (SysID) remains vital for narrowing the reality gap in hardware-deployable policies. In (Masuda et al., 2022), a two-stage identification pipeline—first matching motion trajectories in a simple task, then aligning simulated and real-world cumulative reward distributions using Wasserstein distance on failed and successful trials—enables robust transfer, capturing fine-grained actuation phenomena not amenable to randomization alone.

When traditional SysID is infeasible or high-fidelity actuation models are lacking, leveraging richer on-robot sensors can compensate. "Learning Bipedal Locomotion on Gear-Driven Humanoid Robot Using Foot-Mounted IMUs" (Katayama et al., 1 Apr 2025) bypasses the need for gear/friction identification by providing the policy with left/right foot IMU signals, enabling it to infer contact and rapidly stabilize on perturbed and deformable terrain. Zero-shot transfer experiments show marked improvement in capability and reliability over policies using only base-mounted IMU or kinematics.

Exteroceptive perception now figures prominently in advanced sim-to-real pipelines. VIRAL (He et al., 19 Nov 2025) demonstrates the first visual-loco-manipulation policy, where a large-scale teacher-student RL distillation setup with vision-based students (DINOv3 RGB backbone) achieves robust zero-shot manipulation and walking across a spectrum of lighting, material, and camera extrinsics, all controlled via extensive visual domain randomization and explicit real-to-sim hand/camera alignment.

6. Quantitative Performance and Hardware Validation

Benchmarks across diverse platforms and methods converge on several consistent findings:

  • Precise actuator/transmission modeling and reward-distribution-based SysID yield 100% hardware transfer success (balancing, walking on uneven bricks) for the ROBOTIS-OP3, with trajectory errors and reward distributions tightly matched to simulation (Masuda et al., 2022).
  • Proprioception-only latent state models enable robust transfer to dynamic, aperiodic walking on compliant/uneven terrains in HRP-5P and other full-sized humanoids across both indoor and rough outdoor terrains (Singh et al., 18 Apr 2025).
  • Denoising world models (Gu et al., 26 Aug 2024) yield 100% zero-shot real-world success across slopes, stair climbing, and complex irregular terrain, outperforming both standard PPO and passive-ankle configurations on XBot-S hardware.
  • Visual loco-manipulation policies deployed on the Unitree G1 achieved 91.5% uninterrupted multi-cycle success, with mean loop times matching expert teleoperation and generalizing to novel objects/camera/lighting settings (He et al., 19 Nov 2025).
  • Fast off-policy RL pipelines achieve full locomotor competence in 15 minutes of training and transfer robustly to G1 and Booster T1 robots under heavy domain randomization; controllers withstand 50 N pushes, walk on rough terrain, and perform agile whole-body tasks (Seo et al., 1 Dec 2025).

7. Challenges, Limitations, and Future Directions

Despite major advances, important open challenges and frontiers remain:

  • Actuator and contact modeling: While nonparametric noise/perturbation injection (Cha et al., 9 Apr 2025) enhances robustness, hardware-specific identification for new actuation models and terrain types remains partially manual; catastrophic sim-to-real mismatch can still occur in extreme dynamics.
  • Sensory limitation and partial observability: Extended context and contrastively-learned latent states (Radosavovic et al., 2023, Lu et al., 16 Sep 2025, Wang et al., 9 Mar 2024) are effective, but further advances in onboard adaptive estimation, especially for non-proprioceptive cues, are needed.
  • Integration of rich exteroception: Seamless fusion of vision and tactile data with history-based or transformer policies is nascent, as shown in VIRAL (He et al., 19 Nov 2025), but computational cost and real-time inference latency remain limiting.
  • Hierarchical and compositional skills: Task-level, high-level, and loco-manipulation skills such as in DemoHLM (Fu et al., 13 Oct 2025) and box-carry pipelines (Dao et al., 2023) show the advantage of skill composition and modular learning/transfer mechanisms.
  • Safety and human-robot interaction: Comfort-aware RL with explicit proxemics and dynamic safety constraints (Wang et al., 11 Aug 2025) demonstrates robust collision avoidance in joint navigation/manipulation contexts; extending such constraints to high-speed and high-power behaviors is ongoing.

A plausible implication is that future sim-to-real pipelines will increasingly couple large-scale data-driven modeling (world models, vision-based policies, trajectory transformers) with hybrid, safety-guaranteed constrained learning, and multi-modal sensory integration, while new forms of automated sysID, online adaptation, and simulation fidelity scaling will augment—rather than replace—domain randomization as the backbone of robust humanoid control.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Sim-to-Real Humanoid Locomotion.