Force Randomization & Smoothness Rewards in RL
- The paper demonstrates that incorporating a force randomization curriculum significantly enhances disturbance rejection, reducing tracking error by up to 65%.
- The trajectory smoothness reward explicitly minimizes joint acceleration and jerk, yielding a 45% improvement in motion fluidity while maintaining high tracking accuracy.
- The integrated approach uses PPO with carefully tuned weights to balance tracking, energy, and smoothness, resulting in improved robustness and natural motion in humanoid teleoperation.
Force randomization and trajectory smoothness rewards are complementary components in reinforcement learning (RL) frameworks for neural control of humanoid robots, particularly in the context of adaptive teleoperation. These mechanisms, introduced in "Learning Adaptive Neural Teleoperation for Humanoid Robots: From Inverse Kinematics to End-to-End Control" (Atamuradov, 15 Nov 2025), are integral to achieving robustness against environmental disturbances and generating natural, fluid movement in RL-trained policies that directly map virtual reality (VR) controller inputs to robot joint commands. The force randomization curriculum serves as a disturbance induction method to enable robustness, while trajectory smoothness rewards explicitly encourage policies to minimize jerky or abrupt actuator profiles.
1. Mathematical Formulations
In RL fine-tuning, the reward per timestep is composed of tracking, smoothness, and energy penalties, each defined as follows:
- Trajectory Smoothness Reward:
$r_t^{\text{smooth}} = -\|\ddot{q}_t\|^2 - \lambda_{\mathrm{jerk}}\|\dddot{q}_t\|^2$
Here, is the vector of joint accelerations, $\dddot{q}_t$ the joint jerks, and weights jerk penalty relative to acceleration penalty. Finite differences over actions compute these values with no further normalization.
- Force Randomization (Disturbance Curriculum):
At each simulation step, an external Cartesian force is applied to each end-effector:
denotes the uniform distribution per axis, is the task-defined force magnitude (e.g., 40 N for door opening), and is a curriculum parameter linearly ramped from 0 to 1 during the first half of fine-tuning.
- Overall Reward Composition:
Tracking is penalized via Cartesian and rotational tracking error; energy is penalized as where denotes joint torques. Weights scale each contribution, typically normalized so terms are comparable in magnitude.
2. Mechanism and Curriculum of Force Randomization
Force randomization operates as a disturbance curriculum, wherein external forces—applied at the collision geometry of the robot’s end-effectors—expose the RL policy to a spectrum of perturbations. The magnitude of these disturbances is governed by parameter , which ramps from $0$ to $1$ during RL fine-tuning to progressively increase difficulty. Torso perturbations are also introduced to foster whole-body robustness.
During each step at 100 Hz in simulation, independent random forces are sampled for each Cartesian axis. For instance, in door-opening, is set to match realistic interaction forces (e.g., 40 N), and for pick-and-place, it is set to roughly half the manipulated object’s weight. This systematic exposure encourages the policy to develop implicit disturbance compensation strategies through recurrent mechanisms (such as an LSTM hidden state).
3. Trajectory Smoothness Quantification and Penalty
Trajectory smoothness is penalized by explicitly minimizing both joint acceleration () and jerk ($\dddot{q}_t$). The use of both penalties—weighted by —ensures motion profiles are not only energy efficient but also devoid of high-frequency, abrupt changes typical of classical inverse kinematics and PD-based controllers. The values are computed via finite differencing over consecutive action outputs. No further normalization is performed since all joints share comparable units. The choice of is task-dependent and selected via grid search, reflecting the trade-off between motion speed and smoothness.
4. Integration of Reward Terms and Hyperparameter Selection
The reward terms , , and are linearly combined, each modulated by weights , , . These weights are tuned so that each term contributes similarly to the optimization target of the PPO algorithm, typically via normalization or grid search. The smoothness penalty, force magnitude , force curriculum schedule , rotation-tracking penalty , and jerk weight are all systematically tuned—either by hand or small-scale search—to optimize the trade-off between tracking accuracy and motion naturalness on held-out tasks.
5. Empirical Effects: Ablation and Metrics
Ablation studies isolate the significance of each component on tasks such as door opening:
| Variant | Track Error (cm) | Smoothness (, rad/s²) | Success Rate (%) |
|---|---|---|---|
| Full (force + smoothness) | 2.3 | 5.8 | 89 |
| - Force curriculum removed | 3.8 | 6.1 | 76 |
| - Smoothness reward removed | 2.5 | 11.3 | 88 |
The force curriculum leads to a 13 percentage-point success rate gain under disturbance compared to its removal, and a 65% reduction in tracking error. The smoothness reward nearly halves mean joint acceleration compared to the baseline, eliminating abrupt, jerky motions without significantly affecting raw tracking accuracy (Atamuradov, 15 Nov 2025).
6. Interpretative Insights and Significance
By enforcing the necessity to succeed under randomized force perturbations, the force randomization curriculum conditions the policy to develop implicit disturbance observers and compensation strategies, operationalized through recurrent state (LSTM) modulation of feed-forward torques. This yields markedly superior robustness and a 52% reduction in force-induced tracking error versus the conventional IK+PD baseline.
The explicit smoothness penalty steers policy learning away from “jerky” behaviors characteristic of traditional solvers, instead producing fluid motion that is both more energy-efficient and more natural. Removing the smoothness reward leads to a dramatic increase (nearly double) in joint acceleration, signifying the necessity of explicit regularization for motion naturalness.
The coordinated use of force randomization and trajectory smoothness rewards, tightly integrated within a PPO objective, results in humanoid teleoperation policies that achieve lower tracking error (34%), substantially smoother actions (45%), robust disturbance rejection, and high real-world user preference (87%). Sim-to-real transfer is maintained with minimal performance degradation, underscoring the practical viability of this approach for adaptable and resilient humanoid control (Atamuradov, 15 Nov 2025).