JT-SPI: Robust Joint Torque Perturbation
- JT-SPI is a sim-to-real robustness methodology that injects state-conditioned, learned perturbations into joint torques to address nonlinear actuator and contact-force mismatches.
- The approach leverages an MLP as a universal function approximator to generate bounded perturbations, vastly expanding the diversity of simulated dynamical discrepancies compared to traditional domain randomization.
- Empirical results on high-DOF legged robots demonstrate that JT-SPI significantly improves stability and sim-to-real transfer under challenging unmodeled disturbances.
Joint Torque Space Perturbation Injection (JT-SPI) is a sim-to-real robustness methodology in which, during simulation-time policy training, learned, state-dependent perturbations are injected directly into the joint-torque inputs of the robotic forward dynamics engine. JT-SPI recasts the sim-to-real “reality gap” as an unknown nonlinear mapping from nominal torques to realized torques and exposes the control policy to a much richer family of actuator and contact-force mismatches than can be achieved by randomizing a standard, finite set of simulation parameters. The resulting policies demonstrate increased robustness to complex and previously unseen reality gaps, facilitating successful transfer of motor skills from simulation to hardware for high-DOF legged robots (Cha et al., 9 Apr 2025).
1. Formal Definition and Underlying Principles
JT-SPI addresses reality gap challenges by modeling the mapping from commanded joint torques to actual joint torques as an unknown, potentially nonlinear and state-dependent signal. Traditional domain randomization approaches typically randomize simulator parameters (such as link masses or friction coefficients) across fixed finite sets. JT-SPI, by contrast, introduces perturbations in the joint torque space that are state-conditioned and drawn from a wide functional class, parameterized via a universal function approximator (specifically, a multi-layer perceptron, or MLP).
The joint-space dynamics under domain randomized torque noise are:
JT-SPI injects a perturbation from a function class (implemented as an MLP) with weights re-sampled each episode. At timestep :
- Policy output (nominal torque):
- JT-SPI perturbation:
where limits the maximum perturbation, inputs are normalized privileged full-state observations, and is re-sampled per episode from XavierUniform.
The policy observes partial state , while the perturbation generator accesses privileged full state (normalized). This exposes the policy to a diverse, high-dimensional set of actuator and contact deviations.
2. Algorithmic Implementation
JT-SPI is integrated into on-policy, high-throughput simulation frameworks (e.g., IsaacGym) using parallel rollouts. A high-level pseudocode is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Initialize policy parameters θ, critic parameters, discriminator, etc. for iteration = 1 … N_updates: for env = 1 … N_parallel: # Episode start sample φ_env ~ Xavier # MLP weights for perturbation s_env = s0 step = 0 while not done and step < max_steps: o_env = partial_observation(s_env) a_env = π_θ(o_env) # action in [-1,1]^12 τ_π = τ_limit ⊙ a_env # nominal torques o_priv = privileged_observation(s_env) ĥo_priv = o_priv / running_std τ_φ = σ_lim * tanh(MLP(ĥo_priv; φ_env)) (s_next, r, done) = simulator.step(s_env, τ_π + τ_φ) store_transition(s_env, o_env, a_env, τ_φ, r, s_next, done) s_env = s_next step += 1 # Policy optimization (PPO+AMP+gradpen) |
Key scheduling aspects include re-sampling only at episode boundaries and perturbing 50% of rollouts to maintain learning stability. The perturbation MLP comprises two 256-unit ReLU hidden layers, a final tanh layer for bounded outputs, and uses zero bias in all layers to ensure zero input yields zero perturbation.
3. Experimental Setup and Comparative Analysis
JT-SPI was validated using the TOCABI humanoid platform (100 kg, floating-base, high-ratio harmonic drives with gear ratio 100), controlled at 125 Hz. Training utilized PPO with Adversarial Motion Prior (AMP) imitation and a gradient penalty regularizer.
Comparison was made to:
- Domain Randomization (DR): Randomizes parameters such as terrain friction, link masses, center-of-mass offsets, armature inertia, damping, motor constant, latency, random pushes, and observation noise within prescribed ranges.
- ERFI Baseline: (from Campanaro et al.) Applies untargeted additive torque noise at each joint.
JT-SPI differs fundamentally by generating state-dependent, potentially highly nonlinear perturbations rather than white-noise (ERFI) or a finite augmented parameter set (DR).
Perturbation and DR method characteristics:
| Method | Perturbation Type | State Dependence | Functional Family |
|---|---|---|---|
| DR | Parameter randomization | No | Predefined |
| ERFI | Untargeted torque noise | No | White noise |
| JT-SPI | Learned, bounded perturbation | Yes | Universal approx. |
4. Empirical Evaluation and Scenario Results
JT-SPI was evaluated on velocity-commanded humanoid walking at m/s, with yaw commands in rad/s and zero lateral command. Key performance metrics included forward velocity tracking error (mean, variance), lateral/yaw tracking, and success/failure at maintaining balance.
Test conditions covered:
- Nominal simulation: All methods exhibited similar velocity tracking and gait quality.
- Unseen actuator stiffness (250 Nm/rad, not seen during training): JT-SPI and ERFI succeeded; DR failed.
- Unseen contact compliance (Mujoco solref time constant 0.2 s, soft ground): Only JT-SPI succeeded robustly; DR and ERFI failed (robot falls).
- Sim-to-real transfer (lab, uneven/slippery floor): JT-SPI succeeded in all seeds, DR in 2/3, ERFI in none.
No additional ablations on fraction of perturbed environments or MLP size were reported. These results indicate a substantial robustness improvement of JT-SPI to both actuator and contact-variation reality gaps, with successful transfer under challenging real-world perturbations.
5. Hyperparameters and Training Guidelines
Recommended settings and procedures include:
- MLP architecture: Two hidden layers of 256 ReLU units each; tanh output layer; all layers have zero bias (ensuring zero-motion state yields zero perturbation).
- Observation normalization: Normalize privileged observations by running standard deviation only, omitting mean subtraction.
- Perturbation magnitude (σ_lim): Start at 20 Nm, gradually increase to 50 Nm for joints. For base-force perturbations, use up to 80 N. Adjust based on observed gait stability in nominal simulation.
- Fraction of perturbed rollouts: 50% perturbation recommended to balance robustness and learning stability.
- Perturbation sampling: Newly sample once per episode; re-sampling per timestep is discouraged as it produces high-frequency noise that harms policy learning.
The design rationale leverages the Universal Approximation Theorem, stating the MLP class can model any continuous, bounded torque perturbation mapping . By randomizing each episode, the policy is forced to generalize over a large subspace of such mappings. Empirically, policies withstand higher unmodeled disturbances, supporting robust sim-to-real transfer (Cha et al., 9 Apr 2025).
6. Distinguishing Characteristics and Theoretical Implications
The JT-SPI approach enables policies to encounter a vastly richer set of actuator/force deviations compared to approaches based solely on predefined parameter randomization. State-dependent perturbations allow simulation of complex, context-sensitive discrepancies (e.g., nonuniform friction, actuator nonlinearities) which fixed-parameter or white-noise models cannot represent.
A plausible implication is that as simulator fidelity and robot actuation complexity increase, sim-to-real transfer robustness will demand perturbation schemes with sufficient expressiveness, such as the universal function class utilized in JT-SPI. The finding that half-perturbed rollouts maximize learning stability, and that re-sampling at episode rather than step level is crucial, highlights important scheduler design considerations. JT-SPI thereby contributes a generalizable framework for bridging reality gaps in modern high-DOF robotic locomotion (Cha et al., 9 Apr 2025).