KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills (2506.12851v1)

Published 15 Jun 2025 in cs.RO and cs.AI

Abstract: Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with delicate reward and curriculum design. This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking. For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints to the maximum extent. For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance based on the current tracking error, creating an adaptive curriculum mechanism. We further construct an asymmetric actor-critic framework for policy training. In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions. Our method achieves significantly lower tracking errors than existing approaches and is successfully deployed on the Unitree G1 robot, demonstrating stable and expressive behaviors. The project page is https://kungfu-bot.github.io.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper presents a two-stage framework combining physics-based motion processing and adaptive RL for transferring dynamic human motions to humanoid robots.
It leverages a detailed motion pipeline with filtering, contact-aware correction, and retargeting to convert raw human motion data into physically feasible trajectories.
Experiments demonstrate that the adaptive tracking mechanism and sim-to-real transfer enable superior performance on the Unitree G1 in executing complex skills.

This paper introduces KungfuBot, a physics-based humanoid whole-body control framework (PBHC) designed for learning and executing highly-dynamic skills like Kungfu and dancing on a physical robot. The core idea is to leverage motion capture (MoCap) data or videos of human performance as reference trajectories for a reinforcement learning (RL) policy to imitate. PBHC tackles the challenge of transferring human motions, which are often physically infeasible for robots, and the difficulty of accurately tracking agile movements through a two-stage approach: robust motion processing and adaptive motion tracking within an RL framework.

The first stage, Motion Processing Pipeline, is crucial for translating human motion data into physically achievable reference trajectories for a humanoid robot like the Unitree G1. This pipeline involves several steps:

Motion Estimation from Videos: The process starts by extracting SMPL-format human motion parameters (body shape, joint rotations, global translation) from monocular videos using a method like GVHMR [shen2024gvhmr]. This provides an initial kinematic representation of the human performance.

Physics-based Motion Filtering: Recognizing that raw or reconstructed human motions may violate robot physics, the framework introduces filtering based on physical principles. A key metric used is the projected distance between the Center of Mass (CoM) and Center of Pressure (CoP) on the ground, as suggested by previous work [IPMAN]. Motions are filtered if they exhibit prolonged instability or fail stability criteria at the beginning and end. This practical step prunes infeasible motions early in the pipeline.

# Pseudocode for Physics-based Motion Filtering (simplified)
def is_frame_stable(smpl_params, epsilon_stab):
    com_proj, cop_proj = estimate_com_cop_projection(smpl_params)
    distance = euclidean_distance(com_proj, cop_proj)
    return distance < epsilon_stab

def filter_motion_sequence(smpl_sequence, epsilon_stab, epsilon_N):
    stable_frames = [t for t, params in enumerate(smpl_sequence) if is_frame_stable(params, epsilon_stab)]
    # Check boundary stability
    if 0 not in stable_frames or len(smpl_sequence) - 1 not in stable_frames:
        return False # Motion is not stable at start or end
    # Check maximum instability gap
    max_gap = 0
    last_stable_idx = -1
    for t in range(len(smpl_sequence)):
        if t in stable_frames:
            max_gap = max(max_gap, t - last_stable_idx - 1)
            last_stable_idx = t
    return max_gap < epsilon_N

The paper provides specific thresholds used for filtering, e.g., $\epsilon_{\mathrm{stab}} = 0.1$ and $\epsilon_{\mathrm{N}} = 100$ .

Contact-aware Motion Correction: To improve realism and physical feasibility, the pipeline estimates foot-ground contact masks based on ankle velocity and height (zero-velocity assumption). Floating artifacts are corrected by applying a vertical offset to the global translation based on the lowest vertex z-coordinate when contact is detected. An Exponential Moving Average (EMA) is then applied to smooth the corrected motion, preventing jitter.
Motion Retargeting: Finally, the processed SMPL motions are retargeted to the specific kinematics and joint limits of the target robot (Unitree G1) using an Inverse Kinematics (IK) method. This involves formulating a differentiable optimization problem to match end-effector trajectories while respecting robot constraints.

The second stage focuses on Adaptive Motion Tracking within an asymmetric actor-critic RL framework. This stage addresses the difficulty of tracking highly dynamic motions where precise tracking may not always be physically feasible or desirable for smooth control.

Exponential Form Tracking Reward: Task-specific rewards (joint position, body position, etc.) are defined using an exponential form $r(x)=\exp(-x/\sigma)$ , where $x$ is the tracking error and $\sigma$ is the tracking factor controlling error tolerance. This form is preferred for its boundedness and intuitive weighting.
Optimal Tracking Factor Insight: The paper formulates the problem of finding the optimal $\sigma$ as a bi-level optimization problem, where $\sigma$ should ideally minimize the accumulated tracking error of the converged policy. This theoretical analysis suggests that the optimal $\sigma$ is related to the average optimal tracking error.

Adaptive Mechanism: Since the optimal error is unknown a priori and varies with the motion, a dynamic adaptive mechanism is proposed. It uses an Exponential Moving Average (

\hat{x}

) of the current tracking error during training as an online estimate of the expected error. The tracking factor

\sigma

for each reward term is then iteratively updated to

\min(\sigma, \hat{x})

, creating a closed loop where improved tracking (lower

\hat{x}

) tightens the tolerance (

\sigma

). This allows the policy to progressively refine tracking precision and converge to a suitable

\sigma

for each motion.

# Pseudocode for Adaptive Tracking Factor Update
Initialize sigma_init
sigma = sigma_init
ema_error = initial_error_estimate # e.g., large value or first step error

Loop during training:
    Observe current_error for reward term
    Update ema_error = alpha * current_error + (1 - alpha) * ema_error
    Update sigma = min(sigma, ema_error)
    Use sigma for calculating reward term: reward = exp(-current_error / sigma)
    Perform RL step using calculated reward

The paper initializes $\sigma$ to a relatively large value and enforces a non-increasing update rule to maintain stability.

The RL Training Framework details the policy learning:

Asymmetric Actor-Critic: The actor takes proprioceptive state and time phase as input. The critic receives augmented observation including reference states and randomized physical parameters (privileged information) to aid value estimation.
Reward Vectorization: Rewards are treated as a vector, and the critic has multiple output heads, one for each reward component. This allows for more precise value estimation compared to a single scalar value function.
Reference State Initialization (RSI): Episodes can start at random time phases of the reference motion, improving training efficiency by exposing the agent to diverse states.
Sim-to-Real Transfer: Domain randomization (varying friction, mass, CoM offset, PD gains, control delay, external pushes) is used in the simulator (IsaacGym/MuJoCo) to train a robust policy that can transfer zero-shot to the real Unitree G1 robot.

Implementation and Experiments:

The framework is implemented and evaluated in simulation (IsaacGym/MuJoCo) and on a real Unitree G1 robot. A dataset of highly-dynamic motions is curated from videos (using the processing pipeline) and existing datasets (AMASS, LAFAN), categorized by difficulty (easy, medium, hard).

Experiments demonstrate:

Motion Filtering Effectiveness: Motions deemed stable by the physics-based filter consistently achieve higher episode length ratios during training compared to rejected motions, validating the filter's ability to identify trackable sequences.
Superior Tracking Performance: PBHC significantly outperforms baseline methods (OmniH2O, ExBody2) in simulation across various tracking metrics (position, velocity, acceleration errors) and difficulty levels (Table 1). The adaptive mechanism is key here, adjusting the reward tolerance appropriately.
Adaptive Mechanism Impact: An ablation paper confirms that the adaptive tracking factor mechanism consistently yields near-optimal tracking performance across different motions, whereas policies trained with fixed tracking factors show performance variations depending on the specific motion (Figure 1, Table 4).
Real-World Deployment: Policies trained in simulation transfer successfully to the real Unitree G1 robot, demonstrating stable and expressive execution of complex skills like martial arts and dancing (Figure 2, Figure 3). Quantitative metrics for a Tai Chi motion show close agreement between simulation and real-world performance (Table 2).

Limitations:

Current limitations include the lack of environment awareness (no terrain adaptation or obstacle avoidance) and training individual policies per motion rather than a generalist policy for a repertoire of skills.

Overall, PBHC provides a practical framework for enabling humanoid robots to learn and perform complex, dynamic motions by addressing the crucial steps of motion data processing, adapting reinforcement learning rewards, and utilizing robust sim-to-real transfer techniques.

PDF Markdown

Follow-up Questions

Related Papers

Authors (9)

GitHub

Tweets

https://twitter.com/RoboReading/status/1936450195715182677