Sim-to-Real Compliant Bipedal Locomotion
- The paper demonstrates that targeted domain randomization combined with precise system identification significantly narrows the simulation-to-reality performance gap.
- It outlines model-centric methods such as detailed calibration of robot dynamics and compliant contact modeling to enhance simulator fidelity.
- Policy robustness is achieved through adaptive reinforcement learning and compliance-aware control, improving stability and performance on uneven terrain.
Sim-to-real transfer of compliant bipedal locomotion refers to the challenge of developing locomotion control policies in simulation that can be executed on physical bipedal robots exhibiting mechanical compliance—such as spring–damper elements, high-gear-ratio transmissions, or compliant feet—without loss of performance or stability due to the reality gap. This field encompasses analytical model-based, optimization-based, and deep reinforcement learning (RL) methods, and it addresses discrepancies arising from imperfect modeling of robot dynamics, actuation, contact, estimation, and unmodeled environmental factors.
1. Origins and Sources of the Sim-to-Real Gap
The sim-to-real gap for compliant bipedal robots is primarily attributed to mismatches between simulation and reality in four key areas:
- Robot Dynamics and Joint Compliance: Small errors in link mass or inertia (5–10 %) can destabilize gait cycles. Neglecting actuator, transmission, or drivetrain compliance (spring constants –) introduces phase lags of 5–20 ms and can trigger oscillations or instabilities in real hardware (Bao et al., 9 Nov 2025).
- Contact Modeling and Friction: Compliant feet or ground contacts are often modeled as penalty springs:
but the real can be off by 20–50 %. Incorrect friction () can alter slip frequency by an order of magnitude (Bao et al., 9 Nov 2025).
- State Estimation and Sensing: Simulated sensors have perfect accuracy and zero latency, while real systems are affected by sensor noise, bias, and timing jitter (delays of 5–15 ms), which degrades closed-loop performance.
- Numerical Integration and Solver Fidelity: The physics integrator and contact solver (e.g., penalty vs. complementarity, timestep size) induce artifacts that do not exist in reality, leading to discrepancies in impact and energy dissipation mechanisms.
Collectively, these gaps result in 10–30 % performance discrepancy across step length, COM height, ground-reaction force (GRF) timing, and other critical metrics if left unaddressed (Bao et al., 9 Nov 2025).
2. Model-Centric Methods for Simulator Fidelity
Reducing the sim-to-real gap at the modeling level involves multiple strategies:
2.1 System Identification and Parameter Calibration
Compliant locomotion models must accurately specify:
- Link inertias and masses (balance tests, CAD models)
- Actuator dynamics (step-response experiments for in DC motors)
- Series elasticity or closed-chain compliance (deflection, frequency sweep analyses)
- Contact spring/damper parameters (drop tests, force-indentation experiments)
- Friction and backlash (triaxial actuation analysis)
- Sensor noise and delay (system input–output experiments).
Model parameters are then optimized via cost functions
where is tracking loss (e.g., squared or Wasserstein distance over rollouts) (Bao et al., 9 Nov 2025, Masuda et al., 2022, Yu et al., 2019).
2.2 Compliance and Contact Modeling
Realistic simulation of compliant bipeds uses spring–damper ground and joint models, friction cones, and, for gear-driven actuation, transmission models with directional efficiency: as detailed for ROBOTIS-OP3 in (Masuda et al., 2022). This ensures simulation can reproduce backdrivability effects of high-gear-ratio systems.
2.3 Numerical Integration Practices
- Adoption of semi-implicit schemes (e.g., Newmark, symplectic Euler) with ms
- Tight tolerances in contact solvers ( m penetration)
- Single-threaded deterministic scheduling to prevent run-to-run non-determinism
Model-based control architectures (e.g., HZD+inverse dynamics (Reher et al., 2020), MPC (Chen et al., 24 Sep 2024)) benefit from embedding compliance and realistic constraints into both planning and control pipelines.
3. Policy Robustness and Hardening via Learning
3.1 Domain Randomization
Robust RL policies are trained across a wide ensemble of simulated models by sampling:
- Inertial and compliance parameters: , in compliance ranges
- Contact and friction: ,
- Sensor noise, latency: encoder/IMU noise, ms
- Control loop frequency, output filtering, actuation delay, and PD gains.
Key findings are that targeted randomization (e.g., back-EMF, friction, compliance) is more effective than over-regularizing all parameters (Singh et al., 2023, Masuda et al., 2022). Some pipelines (Duan et al., 2020) incorporate adversarial domain randomization, e.g., cyclic optimization of an adversary network that learns to perturb inputs or physics.
3.2 Reward Functions and Control Structure
- Periodic rewards with phase gating and mirror symmetry regularization (Siekmann et al., 2020, Siekmann et al., 2021, Maslennikov et al., 14 Jul 2025)
- Multi-objective losses: e.g., minimizing attitude, position, and control effort (Chen et al., 24 Sep 2024)
- Hierarchical/two-timescale control: task-space policies regulating feet setpoints with low-level model-based or impedance inverse dynamics (Duan et al., 2020)
- Direct torque-control for robot/task-agnostic compliance (Kim et al., 2023)
- Constraints-as-Terminations for underactuated platforms (Roux et al., 4 Aug 2025)
3.3 Network Architecture
- Feedforward policies (MLP) with sufficient history or explicit current feedback are often sufficient (Singh et al., 2023, Singh et al., 18 Apr 2025)
- Recurrent policies (LSTM) are more robust when trained with dynamics randomization and can perform implicit online system identification (Siekmann et al., 2020)
- Symmetry-aware losses and weight decay improve stability for closed-chain or highly coupled systems (Maslennikov et al., 14 Jul 2025).
Table: Representative Dynamics Randomization Sets
| Parameter | Range | References |
|---|---|---|
| Link mass | [0.9, 1.1] × nominal | [(Bao et al., 9 Nov 2025), ...] |
| Joint damping | [0.5, 1.5] × nominal | (Siekmann et al., 2020, Siekmann et al., 2020) |
| Spring stiffness | [0.7, 1.3] × nominal | (Dao et al., 2022, Maslennikov et al., 14 Jul 2025) |
| Friction coefficient | [0.4, 1.0] | [(Bao et al., 9 Nov 2025), ...] |
| Back-EMF constant | [5, 40] | (Singh et al., 2023) |
4. Practical Workflows and Algorithmic Pipelines
A widely adopted pipeline for compliant bipedal sim-to-real transfer is as follows (Bao et al., 9 Nov 2025, Yu et al., 2019, Masuda et al., 2022):
- Offline system identification: Calibrate inertial, actuation, contact, and sensing parameters via model-based experiments and optimization.
- Residual dynamics learning: Optionally fit neural corrections to capture unmodeled effects.
- Robust policy training: Employ domain randomization, curriculum schedules, and secondary optimization (e.g., DiffTune (Chen et al., 24 Sep 2024)) to auto-tune control gains or policy parameters.
- Zero-shot validation: Deploy on real hardware, monitoring velocity tracking, task success, energy, and failure rates.
- Online adaptation: Use explicit system identification (e.g., EKF, RLS) or implicit context inference (e.g., LSTM, transformer memory) to update the control policy during or after deployment.
For gear-driven or high-gear-ratio compliant platforms, two-stage system identification (pre- and post-policy learning) coupled with low-dimensional policy adaptation (latent variable–conditioned policy) is particularly effective (Yu et al., 2019, Masuda et al., 2022).
5. Experimental Benchmarks and Quantitative Outcomes
Experimental evaluation commonly includes:
- Velocity tracking error (RMSE between commanded and measured velocities)
- Ground reaction force (GRF) profiles vs. simulation
- Step length, maximum speed before failure
- Robustness to dynamic disturbances (push-recovery impulse, slip, terrain transitions)
- Energetic cost (Cost of Transport, mean actuator torques)
- Pass/fail rates under randomization or adversarial schedules
Representative results demonstrate:
| Metric | Sim-only | +DR/randomization | +DR+Online Adapt |
|---|---|---|---|
| Joint-angle error (rad) | 0.020 | 0.012 | 0.010 |
| Push recovery success (%) | 65 % | 92 % | 98 % |
| Terrain (carpet) success (%) | 40 % | 88 % | 95 % |
| Walking on uneven terrain (max step) | ~2 cm | ~5–7 cm | ~7–10 cm |
For torque-policy transfer on human-scale humanoids, zero-shot transfer is achieved without per-task gain tuning, robustness is improved to unexpected ground contacts, and the intrinsic compliance of torque output directly mitigates sim-to-real discrepancies (Kim et al., 2023).
6. Impact of Compliance and Actuation Mode on Transfer
Explicit modeling and exploitation of compliance—whether via variable PD gains (simulated compliance), task-space impedance control, or direct torque control—enable policies to absorb unmodeled ground or actuator effects and maintain stability. Torque-control architectures (especially in series-elastic, backdrivable robots) reduce sensitivity to model error and filter high-frequency contact impulses, resulting in improved sim-to-real performance even in the absence of precise mechanical parameter tuning (Kim et al., 2023, Masuda et al., 2022, Duan et al., 2020).
For robots with closed-chain kinematics or non-serial-chain dynamics (e.g., TopA), explicit inclusion of constraint dynamics and frictional couplings, coupled with symmetry-aware training and adversarial disturbance curricula, yields superior sim-to-real outcomes compared to serial-chain simplifications (Maslennikov et al., 14 Jul 2025).
7. Outlook and Best Practices
Successful sim-to-real transfer of compliant bipedal locomotion controllers leverages a combination of high-fidelity simulation, targeted randomization of dominant gap sources, hierarchical/hybrid policy architectures (high-level RL + low-level model-based/impedance or torque control), and adaptive or context-aware learning structures. Best practices include:
- Identifying and modeling the largest sources of gap (typically compliance, friction, and actuation)
- Including actual hardware feedback in policy observations where feasible (e.g., current feedback)
- Employing targeted, not global, randomization to keep the training tractable
- Preferring simple network architectures for on-hardware inference where memory is not strictly needed
- Validating controllers first in high-fidelity simulation before hardware deployment
Systematic application of these methods has enabled robust, zero- or near-zero-tuning sim-to-real transfer for highly dynamic, compliant bipedal maneuvers, spanning periodic gaits, omni-directional walking, stair and uneven terrain traversal, load-carrying, and highly dynamic jumps (Singh et al., 18 Apr 2025, Batke et al., 2022, Siekmann et al., 2021, Reher et al., 2020, Dao et al., 2022, Rodriguez et al., 2021, Singh et al., 2023, Maslennikov et al., 14 Jul 2025, Chen et al., 24 Sep 2024, Duan et al., 2023, Bao et al., 9 Nov 2025, Roux et al., 4 Aug 2025, Siekmann et al., 2020, Masuda et al., 2022, Yu et al., 2019).