Booster Gym: An End-to-End Reinforcement Learning Framework for Humanoid Robot Locomotion (2506.15132v1)

Published 18 Jun 2025 in cs.RO

Abstract: Recent advancements in reinforcement learning (RL) have led to significant progress in humanoid robot locomotion, simplifying the design and training of motion policies in simulation. However, the numerous implementation details make transferring these policies to real-world robots a challenging task. To address this, we have developed a comprehensive code framework that covers the entire process from training to deployment, incorporating common RL training methods, domain randomization, reward function design, and solutions for handling parallel structures. This library is made available as a community resource, with detailed descriptions of its design and experimental results. We validate the framework on the Booster T1 robot, demonstrating that the trained policies seamlessly transfer to the physical platform, enabling capabilities such as omnidirectional walking, disturbance resistance, and terrain adaptability. We hope this work provides a convenient tool for the robotics community, accelerating the development of humanoid robots. The code can be found in https://github.com/BoosterRobotics/booster_gym.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper presents Booster Gym, an open-source framework that streamlines simulation training and real-world deployment for humanoid locomotion.
It employs an asymmetric actor-critic architecture with PPO and advanced domain randomization to overcome sim-to-real challenges.
Experimental results on the Booster T1 robot demonstrate robust omnidirectional walking, terrain adaptability, and rapid recovery from disturbances.

This paper introduces Booster Gym (2506.15132), an open-source, end-to-end reinforcement learning framework designed specifically for humanoid robot locomotion. The framework aims to simplify the process of training motion policies in simulation and deploying them successfully onto real-world robots, addressing the persistent challenge of the sim-to-real gap.

Booster Gym provides a complete pipeline covering training, testing, and deployment. It incorporates common RL training methods like Proximal Policy Optimization (PPO), robust domain randomization techniques, structured reward function design, and practical solutions for handling robot-specific mechanical challenges, such as parallel structures in ankles. The framework is validated on the Booster T1 humanoid robot, demonstrating capabilities like omnidirectional walking, disturbance resistance, and terrain adaptability through zero-shot transfer from simulation.

Framework Architecture and Implementation:

The framework utilizes an asymmetric actor-critic (AAC) architecture, where the actor policy $\pi_\theta(\boldsymbol{a}_t|\boldsymbol{o}_t)$ operates on partial observations $\boldsymbol{o}_t$ (available on the real robot), while the critic value function $V_\theta(\boldsymbol{s}_t)$ uses the full state $\boldsymbol{s}_t$ (available only in simulation). PPO is used for training.

Observation Space: The actor observation $\boldsymbol{o}_t$ includes proprioceptive data (base angular velocity, gravity vector, joint positions and velocities), the previous action $\boldsymbol{a}_{t-1}$ , desired velocity commands $(v_x, v_y, \omega_{\text{yaw}})$ , and a gait cycle signal $(\cos(2\pi ft), \sin(2\pi ft))$ . The critic receives all actor observations plus privileged information like body mass, CoM, base linear velocity, base height, push force/torque (Table I).
Action Space: The policy outputs joint position offsets $\boldsymbol{a}_t$ . Desired joint positions are calculated as $\boldsymbol{q}_{\text{des}} = \boldsymbol{q}_0 + \boldsymbol{a}_t$ , where $\boldsymbol{q}_0$ are default joint positions. These desired positions are sent to the robot's motors, which use an internal PD controller to compute torque commands $\boldsymbol{\tau}_{\text{des}} = \boldsymbol{k}_p(\boldsymbol{q}_{\text{des}} - \boldsymbol{q}) - \boldsymbol{k}_d \dot{\boldsymbol{q}}$ . The policy runs at 50 Hz, while the motor PD controller runs at a higher frequency for stability.
Reward Function: The reward is a weighted sum of components (Table II) designed to encourage desired behaviors. Key components include:
- Tracking Rewards: For following linear and angular velocity commands. Curriculum learning is used to gradually increase command difficulty.
- Gait Rewards: Encouraging periodic leg movement, using foot height relative to the ground (due to simulation limitations) rather than contact forces.
- Regularization Rewards: Penalizing undesirable states like excessive torso tilt, high torque, power consumption, joint velocity/acceleration, base acceleration, action rate, joint limits, and collisions. Penalties for foot slip, foot yaw/roll alignment, and foot distance are also included.
- Survival Reward: A small constant reward for staying alive.
Episode Design: Episodes last up to 1500 steps (30s) but can terminate early if the robot falls (low base height or high velocity). Rewards are structured to avoid incentivizing early termination.

Bridging the Sim-to-Real Gap:

Domain randomization is crucial for robust sim-to-real transfer. Booster Gym randomizes various parameters during training:

Robot Dynamics: Mass and center of mass (CoM) of links, adding noise to observations.
Actuators: Joint stiffness, damping, friction, and communication delays (0-20 ms sensor-to-actuator latency, informed by real-world measurements like Fig 9 showing 9-12ms).
Environment: Different terrains, contact properties (friction, compliance, restitution), and random external disturbances (kicks/pushes).

Deployment Implementation:

The framework includes a Python-based deployment system:

Trained policies are exported in a JIT-compiled format for efficient execution on the robot's onboard CPU.
The policy runs at 50 Hz, taking real-time sensor data and outputting desired joint positions.
The robot's existing motor PD controller handles the low-level torque generation.
A Booster Robotics SDK based on DDS middleware provides a unified API for interacting with the robot, abstracting hardware interfaces.
A series-parallel conversion module within the SDK addresses parallel kinematic structures (like ankles) by converting policy outputs from a virtual serial structure (used for training compatibility with GPU simulators) to control signals for the physical parallel mechanism using kinematics and a PD controller.

Experimental Results:

Validation on the Booster T1 robot demonstrates the framework's effectiveness:

Omnidirectional Walking: Policies trained with Booster Gym enable stable forward, backward, sideways, and rotational walking, as well as combinations of these movements (Fig 4).
Terrain Adaptation: The robot successfully walks on diverse surfaces including grass, slopes (up to 10 degrees), and uneven terrain without explicit terrain sensing (Fig 5, Fig 6).
Disturbance Resistance: The robot shows robustness against external impacts (e.g., a 10 kg ball) and continuous forces, recovering stability quickly (Fig 7).
Sim-to-Real Transfer: Comparison of joint trajectories between Isaac Gym, MuJoCo (used for cross-simulation testing), and the real robot shows that domain randomization effectively bridges the sim-to-real gap, allowing for successful zero-shot transfer (Fig 8). MuJoCo is highlighted as a useful lightweight testing platform.

Booster Gym is released as an open-source resource with the goal of accelerating humanoid robot development by providing a ready-to-use pipeline for RL-based locomotion. It draws inspiration from existing open-source projects like IsaacGymEnvs [makoviychuk2021isaac], legged_gym [rudin2022learning], rsl_rl, and humanoid-gym [gu2024humanoid].

PDF Markdown

Follow-up Questions

Related Papers

Authors (5)

GitHub

GitHub - BoosterRobotics/booster_gym: Booster Gym is a reinforcement learning (RL) framework designed for humanoid robot locomotion developed by Booster Robotics. (108 stars)