LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning (2504.21738v1)

Published 30 Apr 2025 in cs.RO

Abstract: General-purpose humanoid robots are expected to interact intuitively with humans, enabling seamless integration into daily life. Natural language provides the most accessible medium for this purpose. However, translating language into humanoid whole-body motion remains a significant challenge, primarily due to the gap between linguistic understanding and physical actions. In this work, we present an end-to-end, language-directed policy for real-world humanoid whole-body control. Our approach combines reinforcement learning with policy distillation, allowing a single neural network to interpret language commands and execute corresponding physical actions directly. To enhance motion diversity and compositionality, we incorporate a Conditional Variational Autoencoder (CVAE) structure. The resulting policy achieves agile and versatile whole-body behaviors conditioned on language inputs, with smooth transitions between various motions, enabling adaptation to linguistic variations and the emergence of novel motions. We validate the efficacy and generalizability of our method through extensive simulations and real-world experiments, demonstrating robust whole-body control. Please see our website at LangWBC.github.io for more information.

Summary

The paper presents an end-to-end learning framework that directly maps natural language to whole-body control, overcoming the drawbacks of decoupled motion-generation methods.
It employs a two-stage process with a MoCap-trained teacher policy using reinforcement learning and a CVAE-based student trained via behavior cloning.
Evaluations in simulation and on a Unitree G1 humanoid demonstrate smooth transitions, robust performance under disturbances, and agile real-world executions.

Thinking...

Outlining Paper's Scope

Okay, I'm focusing on the "LangWBC" paper now. My initial step is to map out the key components. I need to understand its approach to language-directed humanoid control, and how it differs from existing methods.

Defining Key Paper Elements

I'm structuring the summary around the core components of "LangWBC." First, the problem is translating language to motion. Prior work struggles due to decoupling or complexity. LangWBC proposes an end-to-end framework, combining RL and distillation, using a CVAE-based student and a MoCap-tracking teacher policy.

Examining LangWBC's Architecture

I'm delving deeper into the "LangWBC" paper now, meticulously dissecting its core architecture. I'm particularly interested in the combination of RL and distillation. The CVAE-based student policy and the MoCap-tracking teacher policy appear to be key.

Recalling the Paper's Focus

Okay, I'm focusing on unpacking the "LangWBC" paper's core ideas. Its goal is to translate natural language into dynamic humanoid motions, which has proven challenging in previous work due to decoupled approaches and complexity. I'm taking notes on how they address that.

This paper, "LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning" (2504.21738), presents a novel framework for enabling humanoid robots to execute complex and agile whole-body motions directly from natural language commands. The core challenge addressed is the gap between linguistic understanding and physical actions for high-dimensional, dynamic systems like humanoids, and the difficulty of ensuring generated motions are physically feasible and robust in the real world.

Prior approaches often decouple this problem into high-level kinematic motion generation and low-level tracking control. While successful in generating diverse motions, these methods frequently produce kinematically plausible but physically impossible trajectories (e.g., floating bodies, unstable poses) that are hard for a tracking controller to follow robustly in the real world. Additionally, these methods struggle with smooth transitions between different behaviors and flexible duration control. End-to-end learning approaches, which map language directly to actions, are promising but have been less explored for dynamic humanoid control due to their complexity and challenges in real-world robustness.

LangWBC proposes an end-to-end solution that uses a single neural network to interpret language commands and directly output low-level control actions. The framework leverages a two-stage training process:

Motion-Tracking Teacher Policy: A teacher policy is trained using reinforcement learning (PPO) in simulation to track a diverse set of physically plausible motions derived from retargeted motion capture (MoCap) data. This teacher policy learns to execute dynamic behaviors while maintaining balance and robustness. It utilizes privileged information (like friction, mass, external forces) available in simulation to learn robust skills. A motion curriculum, starting with easier motions and progressing to harder ones, improves training efficiency. A symmetry loss is also incorporated to encourage balanced movements and reduce training sample complexity.
- Implementation Details: The teacher policy is an MLP taking current robot state (proprioceptive + privileged) and future reference keypoint/joint positions as input, outputting desired joint positions for PD controllers. It runs at 50 Hz. Domain randomization is applied to enhance sim-to-real transfer. Motion retargeting is performed using a Levenberg-Marquardt based IK optimization.
Language-Directed Student Policy: A student policy, built upon a Conditional Variational Autoencoder (CVAE), is trained via behavior cloning (DAgger) to imitate the teacher's actions. Crucially, the student policy receives natural language commands (encoded by a CLIP text encoder) and a history of proprioceptive observations (joint positions/velocities, base velocities/angular velocities, projected gravity) as input, without the privileged information or explicit reference trajectories used by the teacher. The CVAE structure maps the joint input (language + history) to a structured latent space and then decodes a control action.
- Implementation Details: The student CVAE encoder maps concatenated CLIP text embedding and historical proprioceptive observations to the parameters (mean and diagonal variance) of a latent Gaussian distribution. The decoder takes the sampled latent vector and the current observation to output the action. During inference, the mean of the latent distribution is used deterministically. An MLP is used for both the encoder and decoder. Training uses a DAgger-like iterative process where the student's observations are used to query the teacher, and the resulting state-action pairs are used for behavior cloning. A relative-tracking objective (minimizing displacement error rather than absolute position error) is used during student training to mitigate error accumulation and maintain consistency with teacher demonstrations. The student policy also runs at 50 Hz and is applied with the same domain randomization as the teacher.

The CVAE architecture is a key contribution, creating a structured latent space that aligns language semantics and physical actions. This enables several capabilities:

Diversity and Generalization: The latent space captures a joint distribution of language and actions, allowing the policy to generate a wide range of motions conditioned on diverse text inputs. The structured nature helps the policy generalize to unseen text commands that are semantically similar to training data, outperforming an MLP-based student on unfamiliar commands.
Smooth Transitions: By operating within a continuous latent space, the policy can smoothly transition between different behaviors without discrete switching logic or resets, even between agile motions like running and limb movements.
Latent Space Interpolation: Interpolating between latent codes corresponding to different motions allows the generation of novel, blended behaviors not explicitly present in the training data, such as a diagonal walk produced by interpolating between forward walk and side shuffle.

The framework is evaluated extensively through simulations and real-world experiments on a Unitree G1 humanoid robot. Results demonstrate the robot's ability to execute diverse motions (walking, turning, waving, clapping) robustly under external disturbances (kicks, pushes) in zero-shot sim-to-real transfer. Ablation studies confirm that the CVAE architecture, the symmetry loss in teacher training, and the relative-tracking objective for the student all contribute positively to motion quality, stability, and imitation accuracy.

The paper also shows how LangWBC can be integrated with a LLM. The LLM acts as a high-level planner, decomposing abstract instructions or social scenarios into a sequence of primitive language commands that LangWBC can execute. The student CVAE's generalization capability to semantically similar commands is beneficial in this setup, as LLMs may not generate commands identical to the training data.

Practical Implementation Considerations:

Computational Requirements: Training requires significant simulation resources for RL and DAgger. Inference on hardware is shown to run on an AMD Ryzen 9 CPU, suggesting feasibility on reasonable onboard compute.
Data Requirements: Relies on a diverse MoCap dataset with text annotations (like HumanML3D) for teacher training.
Robot Hardware: Demonstrated on a Unitree G1 humanoid. The approach requires a robot capable of dynamic whole-body control and accurate proprioceptive sensing.
Sim-to-Real Gap: Addressed through domain randomization in simulation and the use of only proprioceptive inputs (without privileged information) for the student policy. While demonstrated as zero-shot, the authors note that a more expressive generative model (like diffusion) might further improve sim-to-real transfer by capturing more variance from domain randomization.
Limitations: The current demonstration is limited to dozens of language-conditioned motions due to computational constraints in training. The focus is primarily on locomotion and upper-body movements, lacking integration with vision for tasks requiring interaction with the environment.

In conclusion, LangWBC represents a significant step towards intuitive, language-controlled humanoid robots by proposing an end-to-end learning framework that directly maps language to robust, dynamic whole-body actions, laying groundwork for potential foundation models in humanoid control.