Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniXtreme: Scalable Humanoid Motion Tracking

Updated 4 July 2026
  • OmniXtreme is a two-stage framework for scalable, high-fidelity tracking of diverse humanoid motions, decoupling general motor-skill learning from physical-skill refinement.
  • The framework employs flow-matching pretraining with DAgger-style distillation to mitigate gradient interference and enhance performance on simulation benchmarks and real-world tests.
  • A lightweight residual RL stage refines the base policy under actuation-aware constraints, ensuring robust execution of extreme motions on physical humanoid platforms.

Searching arXiv for the specified paper and any directly related cited work. OmniXtreme is a two-stage framework for scalable, high-fidelity humanoid motion tracking designed to address what its authors term the “generality barrier” in multi-motion policy learning and the sim-to-real actuation constraints that impede high-dynamic behaviors on physical humanoids. It targets the setting in which a single unified policy must track a large and heterogeneous motion library—including walking, dancing, flips, martial arts, and other high-difficulty clips—without the collapse in tracking fidelity that commonly accompanies scale. The framework decouples general motor-skill acquisition from physical-skill refinement by combining flow-matching generative pretraining with actuation-aware residual reinforcement learning, and is evaluated on simulation benchmarks and a Unitree G1 humanoid platform (Wang et al., 27 Feb 2026).

1. Problem setting and the “generality barrier”

OmniXtreme is motivated by a failure mode observed in high-dynamic humanoid control: as motion libraries grow to include hundreds of heterogeneous clips, conventional multi-motion RL controllers incur severe gradient interference and conservative averaging, causing fidelity to collapse on the hardest motions. The paper characterizes this as a long-standing fidelity–scalability trade-off: a controller can often either scale to diverse motion data or preserve high-fidelity tracking on extreme motions, but not both simultaneously (Wang et al., 27 Feb 2026).

The framework identifies two compounding bottlenecks. The first is the learning bottleneck in scaling multi-motion optimization. The second is the physical executability bottleneck that arises when policies trained under simplified actuator models are deployed on hardware. In the reported formulation, joint limits and naive torque caps are insufficient because they fail to capture velocity-dependent losses, regenerative braking, and power limits; under high-dynamic motions, small deviations can then cascade into hardware failure (Wang et al., 27 Feb 2026).

This decomposition is central to the method’s structure. OmniXtreme explicitly separates general motor-skill learning from sim-to-real physical-skill refinement. A plausible implication is that the framework treats policy generalization and hardware feasibility as related but distinct optimization problems rather than as a single end-to-end RL objective.

2. Motion representation and flow-matching pretraining

The motion dataset is denoted M={mi}i=1MM = \{m_i\}_{i=1}^M, where each mim_i is a reference motion clip retargeted to the Unitree G1 humanoid. Observations are written as o=(p,c,h)o = (p, c, h), with pp consisting of proprioception (q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action}), cc denoting the commanded reference as 6D torso orientation error together with target joint qref,q˙refq^{ref}, \dot{q}^{ref}, and hh denoting proprioceptive history over the past KK steps (Wang et al., 27 Feb 2026).

For each motion mim_i, an expert policy mim_i0 is trained via PPO imitation. The objective is then to distill the set mim_i1 into a single flow-based policy mim_i2. Rather than using a conventional unified PPO controller, OmniXtreme uses a flow-matching objective intended to scale representation capacity while avoiding the interference-intensive gradients associated with multi-motion RL (Wang et al., 27 Feb 2026).

The pretraining stage injects Gaussian noise mim_i3 and defines a probability path

mim_i4

A time-indexed velocity field mim_i5 is learned such that

mim_i6

Equivalently, the model minimizes the denoising loss

mim_i7

At inference, sampling begins from mim_i8 and integrates backward with Euler steps:

mim_i9

These equations define the policy as a conditional generative model over actions rather than a direct action regressor (Wang et al., 27 Feb 2026).

The paper attributes the benefit of this stage to two design choices: decoupling from RL gradients and using a generative objective. In the reported interpretation, this avoids the destructive gradient interference that plagues large multi-motion PPO. Fidelity-preserving noise and moderate domain-randomization are also included to instill base sim-to-real robustness without collapsing motion expressivity (Wang et al., 27 Feb 2026).

3. Architecture and training workflow

The policy architecture tokenizes o=(p,c,h)o = (p, c, h)0, o=(p,c,h)o = (p, c, h)1, and each history step into embeddings using small MLPs, then processes them with a Transformer encoder followed by a large MLP head of approximately o=(p,c,h)o = (p, c, h)2 parameters to implement o=(p,c,h)o = (p, c, h)3. The use of a high-capacity architecture is presented as a mechanism for scaling representation capacity while preserving motion-specific detail across diverse datasets (Wang et al., 27 Feb 2026).

The training workflow is organized around a DAgger-style distillation loop. In the reported Algorithm 1, the system maintains a replay buffer o=(p,c,h)o = (p, c, h)4, repeatedly samples a motion o=(p,c,h)o = (p, c, h)5, rolls out the current o=(p,c,h)o = (p, c, h)6 on that motion to collect observations o=(p,c,h)o = (p, c, h)7, labels those observations with expert actions o=(p,c,h)o = (p, c, h)8, and stores o=(p,c,h)o = (p, c, h)9 pairs in pp0. Gradient steps then sample pp1, draw pp2 and pp3, form pp4, compute pp5, and update pp6 by the flow-matching loss (Wang et al., 27 Feb 2026).

The paper’s architectural overview separates the system into three operational phases: flow-matching pretraining via DAgger distillation, post-training residual RL refinement under actuation-aware constraints, and onboard inference. On hardware, onboard inference is performed with TensorRT acceleration at 50 Hz with end-to-end latency of approximately 10 ms (Wang et al., 27 Feb 2026).

This decomposition suggests a modular control stack. A plausible implication is that the base flow policy can be regarded as a generalized motion prior, while the post-training stage acts as a low-dimensional adaptation layer for physical execution.

4. Actuation-aware residual refinement

After pretraining, OmniXtreme freezes pp7 and learns a lightweight residual MLP pp8 via PPO. The total command is

pp9

The residual policy observes robot proprioception, the commanded reference (q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})0, and (q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})1, while the critic has access to privileged simulator state. This stage is described as a lightweight refinement layer that guarantees physical executability without re-optimizing the full high-capacity base model (Wang et al., 27 Feb 2026).

The refinement stage is built around three actuation-aware components.

First, it imposes a torque-speed operating envelope. Given instantaneous joint velocity (q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})2 and commanded torque (q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})3, the method defines

(q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})4

The clipped torque envelope is then

(q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})5

Nonlinear friction losses are applied as

(q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})6

These equations explicitly encode velocity-dependent actuation degradation rather than relying on fixed torque limits (Wang et al., 27 Feb 2026).

Second, the method introduces a power-safe regularization term that penalizes excessive negative power on knee joints. For each joint,

(q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})7

and the regularizer is

(q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})8

This term is aimed at regenerative-braking regimes that are especially relevant in aggressive landings and decelerations (Wang et al., 27 Feb 2026).

Third, the framework applies aggressive domain randomization with enlarged ranges for initial pose, contact friction, action delay, surface perturbations, and relaxed termination thresholds by a factor of (q,q˙,base ω,last action)(q, \dot{q}, \text{base } \omega, \text{last action})9. The residual RL loop then rolls out the combined policy cc0, clips torques using the actuation-aware equations before simulation steps, and updates cc1 with PPO using motion-tracking rewards plus the penalty cc2 until convergence (Wang et al., 27 Feb 2026).

Ablation results reported in the paper show that motor constraints, aggressive domain randomization, and the power penalty are each critical for different failure modes. The paper does not reduce these failure modes to a single cause, which indicates that sim-to-real robustness is distributed across multiple physical effects rather than dominated by one actuator-modeling term (Wang et al., 27 Feb 2026).

5. Experimental configuration and quantitative performance

The reported simulation evaluation uses two motion libraries. LaFAN1 contains approximately 80 motions and serves as a standard benchmark. XtremeMotion contains approximately 60 curated extreme flips, acrobatics, b-boying, and martial arts clips. Diversity is measured by kinematic complexity, including maximum joint angular velocity, angular acceleration, jerk, center-of-mass vertical velocity, airborne ratio, and contact-switch frequency (Wang et al., 27 Feb 2026).

Evaluation metrics are MPJPE in millimeters, cc3 in millimeters per frame, cc4 in millimeters per frame squared, and success rate under the same termination thresholds used during training. Unseen-motion generalization is assessed using 1,000 held-out clips from AMASS that were retargeted and excluded from training (Wang et al., 27 Feb 2026).

On the combined LaFAN1 plus XtremeMotion benchmark, from-scratch RL achieves MPJPE 47.95, cc5 10.03, cc6 3.27, and success rate 82.95%. A Specialist→Unified MLP achieves 33.35, 6.70, 2.11, and 94.91%, respectively. OmniXtreme pretraining alone achieves 32.65, 6.34, 2.04, and 97.17%. OmniXtreme with post-training refinement achieves 30.93, 6.19, 2.13, and 98.54% (Wang et al., 27 Feb 2026).

On XtremeMotion alone, from-scratch RL yields MPJPE 54.19, cc7 14.04, cc8 4.04, and success rate 79.45%. Specialist→Unified MLP yields 43.43, 11.38, 2.51, and 89.22%. OmniXtreme pretraining yields 37.11, 10.46, 2.39, and 95.16%. OmniXtreme with post-training yields 36.17, 9.94, 2.58, and 95.64% (Wang et al., 27 Feb 2026).

On unseen motions, from-scratch RL records MPJPE 56.87 and success rate 85.29%, Specialist→Unified MLP records 58.94 and 85.95%, OmniXtreme pretraining records 56.25 and 89.23%, and OmniXtreme with post-training records 56.05 and 89.54% (Wang et al., 27 Feb 2026).

These results are reported as evidence that the flow-matching stage improves fidelity and success under large motion diversity, while the refinement stage contributes an additional gain in both in-distribution and unseen settings. This suggests that the two stages are complementary rather than redundant: the pretraining stage primarily addresses policy generality, whereas the post-training stage primarily addresses executability.

6. Real-robot deployment, interpretation, and limitations

The physical testbed is a Unitree G1 humanoid with an Orin NX onboard computer. Inference runs at 50 Hz through TensorRT, and PD control maps actions to joint torques (Wang et al., 27 Feb 2026). The paper reports successful execution of multiple extreme motions by a single unified policy, including flips, handsprings, acrobatics, breakdance, and martial arts sequences.

The real-robot evaluation reports the following success rates: flips, 96.4% over 55 attempts across 7 motions; handsprings, 88.6% over 35 attempts across 5 motions; acrobatics, 80.0% over 15 attempts across 4 motions; breakdance, 86.4% over 22 attempts across 5 motions; and martial arts, 93.3% over 30 attempts across 3 motions. Across 24 motions and 157 attempts, the total success rate is 91.1% (Wang et al., 27 Feb 2026).

In the paper’s interpretation, these outcomes indicate that OmniXtreme “breaks” the fidelity–scalability trade-off in high-dynamic humanoid control by maintaining high-fidelity tracking across diverse, high-difficulty datasets and transferring multiple extreme motions to hardware with a unified policy (Wang et al., 27 Feb 2026). A cautious reading is that the framework shifts the trade-off frontier rather than eliminating all constraints: the reported discussion still acknowledges residual hardware-bound failure modes.

The stated limitations are specific. The residual policy may not fully exploit the base model’s capacity, and extreme landing failures still expose unmodeled hardware limits, including battery voltage effects and current spiking (Wang et al., 27 Feb 2026). Future directions named in the paper include joint fine-tuning of the base policy under actuation constraints, such as direct flow-policy RL, richer power-system modeling, and scaling to visio-motor or interactive tasks; the paper cites Yi et al. in this context (Yi et al., 2 Feb 2026).

OmniXtreme therefore occupies a specific position in the humanoid-control literature: it is not merely a larger tracking policy, but a framework that partitions the problem into scalable motion representation and physically grounded actuation refinement. The central claim is not that high-capacity policies alone solve high-dynamic humanoid control, but that capacity must be paired with an optimization objective that avoids multi-motion interference and with a refinement stage that respects actuator-level operating constraints (Wang et al., 27 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniXtreme.