OmniXtreme: Scalable Humanoid Motion Tracking
- OmniXtreme is a two-stage framework for scalable, high-fidelity tracking of diverse humanoid motions, decoupling general motor-skill learning from physical-skill refinement.
- The framework employs flow-matching pretraining with DAgger-style distillation to mitigate gradient interference and enhance performance on simulation benchmarks and real-world tests.
- A lightweight residual RL stage refines the base policy under actuation-aware constraints, ensuring robust execution of extreme motions on physical humanoid platforms.
Searching arXiv for the specified paper and any directly related cited work. OmniXtreme is a two-stage framework for scalable, high-fidelity humanoid motion tracking designed to address what its authors term the “generality barrier” in multi-motion policy learning and the sim-to-real actuation constraints that impede high-dynamic behaviors on physical humanoids. It targets the setting in which a single unified policy must track a large and heterogeneous motion library—including walking, dancing, flips, martial arts, and other high-difficulty clips—without the collapse in tracking fidelity that commonly accompanies scale. The framework decouples general motor-skill acquisition from physical-skill refinement by combining flow-matching generative pretraining with actuation-aware residual reinforcement learning, and is evaluated on simulation benchmarks and a Unitree G1 humanoid platform (Wang et al., 27 Feb 2026).
1. Problem setting and the “generality barrier”
OmniXtreme is motivated by a failure mode observed in high-dynamic humanoid control: as motion libraries grow to include hundreds of heterogeneous clips, conventional multi-motion RL controllers incur severe gradient interference and conservative averaging, causing fidelity to collapse on the hardest motions. The paper characterizes this as a long-standing fidelity–scalability trade-off: a controller can often either scale to diverse motion data or preserve high-fidelity tracking on extreme motions, but not both simultaneously (Wang et al., 27 Feb 2026).
The framework identifies two compounding bottlenecks. The first is the learning bottleneck in scaling multi-motion optimization. The second is the physical executability bottleneck that arises when policies trained under simplified actuator models are deployed on hardware. In the reported formulation, joint limits and naive torque caps are insufficient because they fail to capture velocity-dependent losses, regenerative braking, and power limits; under high-dynamic motions, small deviations can then cascade into hardware failure (Wang et al., 27 Feb 2026).
This decomposition is central to the method’s structure. OmniXtreme explicitly separates general motor-skill learning from sim-to-real physical-skill refinement. A plausible implication is that the framework treats policy generalization and hardware feasibility as related but distinct optimization problems rather than as a single end-to-end RL objective.
2. Motion representation and flow-matching pretraining
The motion dataset is denoted , where each is a reference motion clip retargeted to the Unitree G1 humanoid. Observations are written as , with consisting of proprioception , denoting the commanded reference as 6D torso orientation error together with target joint , and denoting proprioceptive history over the past steps (Wang et al., 27 Feb 2026).
For each motion , an expert policy 0 is trained via PPO imitation. The objective is then to distill the set 1 into a single flow-based policy 2. Rather than using a conventional unified PPO controller, OmniXtreme uses a flow-matching objective intended to scale representation capacity while avoiding the interference-intensive gradients associated with multi-motion RL (Wang et al., 27 Feb 2026).
The pretraining stage injects Gaussian noise 3 and defines a probability path
4
A time-indexed velocity field 5 is learned such that
6
Equivalently, the model minimizes the denoising loss
7
At inference, sampling begins from 8 and integrates backward with Euler steps:
9
These equations define the policy as a conditional generative model over actions rather than a direct action regressor (Wang et al., 27 Feb 2026).
The paper attributes the benefit of this stage to two design choices: decoupling from RL gradients and using a generative objective. In the reported interpretation, this avoids the destructive gradient interference that plagues large multi-motion PPO. Fidelity-preserving noise and moderate domain-randomization are also included to instill base sim-to-real robustness without collapsing motion expressivity (Wang et al., 27 Feb 2026).
3. Architecture and training workflow
The policy architecture tokenizes 0, 1, and each history step into embeddings using small MLPs, then processes them with a Transformer encoder followed by a large MLP head of approximately 2 parameters to implement 3. The use of a high-capacity architecture is presented as a mechanism for scaling representation capacity while preserving motion-specific detail across diverse datasets (Wang et al., 27 Feb 2026).
The training workflow is organized around a DAgger-style distillation loop. In the reported Algorithm 1, the system maintains a replay buffer 4, repeatedly samples a motion 5, rolls out the current 6 on that motion to collect observations 7, labels those observations with expert actions 8, and stores 9 pairs in 0. Gradient steps then sample 1, draw 2 and 3, form 4, compute 5, and update 6 by the flow-matching loss (Wang et al., 27 Feb 2026).
The paper’s architectural overview separates the system into three operational phases: flow-matching pretraining via DAgger distillation, post-training residual RL refinement under actuation-aware constraints, and onboard inference. On hardware, onboard inference is performed with TensorRT acceleration at 50 Hz with end-to-end latency of approximately 10 ms (Wang et al., 27 Feb 2026).
This decomposition suggests a modular control stack. A plausible implication is that the base flow policy can be regarded as a generalized motion prior, while the post-training stage acts as a low-dimensional adaptation layer for physical execution.
4. Actuation-aware residual refinement
After pretraining, OmniXtreme freezes 7 and learns a lightweight residual MLP 8 via PPO. The total command is
9
The residual policy observes robot proprioception, the commanded reference 0, and 1, while the critic has access to privileged simulator state. This stage is described as a lightweight refinement layer that guarantees physical executability without re-optimizing the full high-capacity base model (Wang et al., 27 Feb 2026).
The refinement stage is built around three actuation-aware components.
First, it imposes a torque-speed operating envelope. Given instantaneous joint velocity 2 and commanded torque 3, the method defines
4
The clipped torque envelope is then
5
Nonlinear friction losses are applied as
6
These equations explicitly encode velocity-dependent actuation degradation rather than relying on fixed torque limits (Wang et al., 27 Feb 2026).
Second, the method introduces a power-safe regularization term that penalizes excessive negative power on knee joints. For each joint,
7
and the regularizer is
8
This term is aimed at regenerative-braking regimes that are especially relevant in aggressive landings and decelerations (Wang et al., 27 Feb 2026).
Third, the framework applies aggressive domain randomization with enlarged ranges for initial pose, contact friction, action delay, surface perturbations, and relaxed termination thresholds by a factor of 9. The residual RL loop then rolls out the combined policy 0, clips torques using the actuation-aware equations before simulation steps, and updates 1 with PPO using motion-tracking rewards plus the penalty 2 until convergence (Wang et al., 27 Feb 2026).
Ablation results reported in the paper show that motor constraints, aggressive domain randomization, and the power penalty are each critical for different failure modes. The paper does not reduce these failure modes to a single cause, which indicates that sim-to-real robustness is distributed across multiple physical effects rather than dominated by one actuator-modeling term (Wang et al., 27 Feb 2026).
5. Experimental configuration and quantitative performance
The reported simulation evaluation uses two motion libraries. LaFAN1 contains approximately 80 motions and serves as a standard benchmark. XtremeMotion contains approximately 60 curated extreme flips, acrobatics, b-boying, and martial arts clips. Diversity is measured by kinematic complexity, including maximum joint angular velocity, angular acceleration, jerk, center-of-mass vertical velocity, airborne ratio, and contact-switch frequency (Wang et al., 27 Feb 2026).
Evaluation metrics are MPJPE in millimeters, 3 in millimeters per frame, 4 in millimeters per frame squared, and success rate under the same termination thresholds used during training. Unseen-motion generalization is assessed using 1,000 held-out clips from AMASS that were retargeted and excluded from training (Wang et al., 27 Feb 2026).
On the combined LaFAN1 plus XtremeMotion benchmark, from-scratch RL achieves MPJPE 47.95, 5 10.03, 6 3.27, and success rate 82.95%. A Specialist→Unified MLP achieves 33.35, 6.70, 2.11, and 94.91%, respectively. OmniXtreme pretraining alone achieves 32.65, 6.34, 2.04, and 97.17%. OmniXtreme with post-training refinement achieves 30.93, 6.19, 2.13, and 98.54% (Wang et al., 27 Feb 2026).
On XtremeMotion alone, from-scratch RL yields MPJPE 54.19, 7 14.04, 8 4.04, and success rate 79.45%. Specialist→Unified MLP yields 43.43, 11.38, 2.51, and 89.22%. OmniXtreme pretraining yields 37.11, 10.46, 2.39, and 95.16%. OmniXtreme with post-training yields 36.17, 9.94, 2.58, and 95.64% (Wang et al., 27 Feb 2026).
On unseen motions, from-scratch RL records MPJPE 56.87 and success rate 85.29%, Specialist→Unified MLP records 58.94 and 85.95%, OmniXtreme pretraining records 56.25 and 89.23%, and OmniXtreme with post-training records 56.05 and 89.54% (Wang et al., 27 Feb 2026).
These results are reported as evidence that the flow-matching stage improves fidelity and success under large motion diversity, while the refinement stage contributes an additional gain in both in-distribution and unseen settings. This suggests that the two stages are complementary rather than redundant: the pretraining stage primarily addresses policy generality, whereas the post-training stage primarily addresses executability.
6. Real-robot deployment, interpretation, and limitations
The physical testbed is a Unitree G1 humanoid with an Orin NX onboard computer. Inference runs at 50 Hz through TensorRT, and PD control maps actions to joint torques (Wang et al., 27 Feb 2026). The paper reports successful execution of multiple extreme motions by a single unified policy, including flips, handsprings, acrobatics, breakdance, and martial arts sequences.
The real-robot evaluation reports the following success rates: flips, 96.4% over 55 attempts across 7 motions; handsprings, 88.6% over 35 attempts across 5 motions; acrobatics, 80.0% over 15 attempts across 4 motions; breakdance, 86.4% over 22 attempts across 5 motions; and martial arts, 93.3% over 30 attempts across 3 motions. Across 24 motions and 157 attempts, the total success rate is 91.1% (Wang et al., 27 Feb 2026).
In the paper’s interpretation, these outcomes indicate that OmniXtreme “breaks” the fidelity–scalability trade-off in high-dynamic humanoid control by maintaining high-fidelity tracking across diverse, high-difficulty datasets and transferring multiple extreme motions to hardware with a unified policy (Wang et al., 27 Feb 2026). A cautious reading is that the framework shifts the trade-off frontier rather than eliminating all constraints: the reported discussion still acknowledges residual hardware-bound failure modes.
The stated limitations are specific. The residual policy may not fully exploit the base model’s capacity, and extreme landing failures still expose unmodeled hardware limits, including battery voltage effects and current spiking (Wang et al., 27 Feb 2026). Future directions named in the paper include joint fine-tuning of the base policy under actuation constraints, such as direct flow-policy RL, richer power-system modeling, and scaling to visio-motor or interactive tasks; the paper cites Yi et al. in this context (Yi et al., 2 Feb 2026).
OmniXtreme therefore occupies a specific position in the humanoid-control literature: it is not merely a larger tracking policy, but a framework that partitions the problem into scalable motion representation and physically grounded actuation refinement. The central claim is not that high-capacity policies alone solve high-dynamic humanoid control, but that capacity must be paired with an optimization objective that avoids multi-motion interference and with a refinement stage that respects actuator-level operating constraints (Wang et al., 27 Feb 2026).