CHIP: Humanoid Control via Hindsight Perturbation

Updated 24 December 2025

The paper’s main contribution is introducing a plug-and-play compliance module using hindsight perturbation to unify agile locomotion and safe manipulation in RL-based humanoid control.
It employs random force perturbations and a compliance coefficient to adjust reference goals, enabling variable end-effector stiffness in dynamic tasks.
Empirical results show enhanced tracking accuracy, improved compliance under force, and successful multi-robot collaboration across diverse manipulation tasks.

Compliance Humanoid Control through hIsight Perturbation (CHIP) is an adaptive control paradigm designed to endow reinforcement learning-based humanoid motion-tracking controllers with task-adaptive compliance while preserving agile, reference-driven performance. CHIP is structured as a plug-and-play module that interposes on existing keypoint tracking frameworks, facilitating variable end-effector stiffness and compliant interaction for forceful manipulation. Unlike traditional reward-tuning or model-based approaches, CHIP leverages a combination of hindsight goal correction—here called "hindsight perturbation"—and a compliance coefficient to unify agile locomotion and safe, compliant manipulation in a single policy (Chen et al., 16 Dec 2025).

1. Control Problem and Motivation

Humanoid robots, particularly those trained with high-gain whole-body motion tracking, excel at dynamic tasks such as running, backflipping, and crawling. However, these controllers typically lack controlled compliance in contact-rich settings, rendering them excessively stiff and at risk of instability or hardware damage during manipulation tasks involving significant external forces. The principal challenge is to achieve:

Sufficient force application for effective manipulation (e.g., door opening, pushing carts),
Safe yielding or compliance under unexpected contact to avoid force spikes,
Uninterrupted whole-body motion tracking of dynamic and variable trajectories.

Classical task-space impedance control, standard in fixed-base robotic arms, models end-effector compliance using a spring-damper law. However, this is not directly compatible with RL-based humanoid controllers that treat any deviation from the reference as tracking error, producing problematic force reactions during contact events (Chen et al., 16 Dec 2025).

2. Hindsight Perturbation Principle

CHIP's central innovation is the use of random end-effector force perturbations injected during training, coupled with a novel hindsight adjustment of the policy's goal observation. Specifically, rather than editing demonstration trajectories to simulate compliant responses, CHIP subtracts the expected spring deflection induced by the perturbation from the reference goal presented to the policy:

$\bm g_{\mathrm{obs}, t} = \bm g_{\mathrm{ref}, t} - \kappa\, \bm f_t$

where $\kappa = 1/k$ is the compliance coefficient, and $\bm f_t$ is the applied perturbation.

The reward, however, is still computed with respect to the original reference:

$r_t = r( \bm x_{\mathrm{eef}, t},\, \bm g_{\mathrm{ref}, t} )$

This construction ensures the policy experiences goals consistent with compliant deflection, promoting the emergence of force-aware, compliant behaviors without modification of demonstration data or reward structure. The approach enables the robot to "imitate" dataset motions as they would appear under compliance, facilitating natural behavioral synthesis. A single unified policy is trained that can infer external forces through proprioceptive drift accumulated in policy history (Chen et al., 16 Dec 2025).

3. Mathematical Formulation and Control Architecture

3.1 Dynamics and Notation

The underlying robot is characterized by standard rigid-body equations: $M(q) \ddot q + C(q, \dot q) \dot q + g(q) = \tau + J^\top(q)\, \bm F_{\mathrm{ext}}$ with joint configuration $q$ , Jacobian $J(q)$ , and external end-effector force $\bm F_{\mathrm{ext}}$ . The reference $\bm g_{\mathrm{ref}, t}$ typically denotes the 6D pose of salient keypoints (head, wrists).

3.2 Task-Space Impedance Control

The internal tracking controller utilizes a task-space impedance law:

$\bm F_{\mathrm{cmd}} = K_p ( \bm g_t - \bm x_{\mathrm{eef}} ) + K_d ( \dot{\bm g}_t - \dot{\bm x}_{\mathrm{eef}} )$

$\tau = J^\top \bm F_{\mathrm{cmd}} + \tau_{\mathrm{null}}$

In CHIP, $K_p$ is parameterized as an overall scalar compliance coefficient: $K_p = \frac{1}{\kappa} K_{p0}$ where $\kappa \in [0, \kappa_{\max}]$ . The policy outputs a command for $\kappa$ , thereby controlling effective stiffness continuously during inference.

3.3 Policy Input, Reward, and Training Protocol

The policy receives as input the proprioceptive state, a history window, past actions, the compliance coefficient $\kappa$ , and the hindsight-corrected goal $\bm g_{\mathrm{obs}, t}$ . The reinforcement learning objective (PPO) is

$\E \left[ \sum_{t} \gamma^t\, r(s_t, \bm g_{\mathrm{ref}, t}) \right]$

Training involves sampling random perturbations $\bm f_t$ (magnitude $\sim U[0,40$ ] N, duration 1--3 s), issuance of the adjusted goal, and trajectory execution in simulation. Privileged information (true $\bm f_t$ ) is provided to the critic during training to accelerate force adaptation (Chen et al., 16 Dec 2025).

4. CHIP Algorithm: Implementation Details

CHIP interfaces with existing tracking architectures (e.g., SONIC, OmniH2O) via two additional policy inputs: the compliance coefficient $\kappa$ and the corrected goal $\bm g^{\mathrm{obs}}$ . The core algorithm can be summarized as:

Training phase: Reference trajectory selection, perturbation sampling and hindsight-corrected goal computation, action sampling via PPO, reward evaluation with standard tracking terms (root, link, 3-point errors), and policy update.
Inference phase: No perturbation or privileged input is required; the trained policy outputs control commands in response to proprioceptive input, reference trajectory, and user-specified or contextually inferred $\kappa$ .

The network input expansion is approximately 5%. The per-step overhead is negligible ( $<$ 2 ms on Jetson NX with TensorRT). Training utilizes 64 GPUs, 4096 environments per GPU, and four days of simulated time (Chen et al., 16 Dec 2025).

5. Empirical Performance and Use Cases

Experiments demonstrate that CHIP achieves:

Tracking Accuracy: In zero-force (compliance-off) scenarios, CHIP incurs global pose error 0.08 m, outperforming FALCON (0.09 m) and no-perturbation baselines (0.11 m). Local errors (0.02 m) are uniform across methods, indicating no tracking degradation from compliance (Chen et al., 16 Dec 2025).
Compliance Modulation: Under a 20 N push, end-effector displacement shows linear dependence on $\kappa$ , in sharp contrast to weak sensitivity from reward-tuning methods.
Multi-Robot Collaboration: Success rate in dual-robot grasp and lift (boxes, spheres) is 80% with CHIP, versus 5% (always-stiff) and 40% (no-perturbation).
Teleoperation Scenarios: Unified controller demonstrates context-adaptive transitions (high-force door opening, compliant wiping, bimanual tasks with asymmetric compliance); on-the-fly $\kappa$ tuning is supported.
Vision–Language–Action (VLA): In VLA-trained tasks, autonomous large whiteboard wiping achieves 60% success, and bimanual wipe+hold on small whiteboard achieves 80%, based on 400 and 200 demonstrations, respectively.

Other compliance adaptation strategies generally require explicit detection of force events and online gain switching. For example, (Gao et al., 2020) uses a Bi-GRU-based predictor to model expected force profiles from kinematic data; anomalies trigger an adaptive impedance controller that rapidly reduces stiffness during unexpected contact and gradually restores high-stiffness on task resumption. While achieving sub-Newton force accuracy and <70 ms anomaly response, these approaches require explicit contact sensing, anomaly thresholds, and mode-switching logic.

By contrast, CHIP's approach is model- and sensor-agnostic: compliance emerges implicitly via hindsight perturbation, without reward modification, data augmentation, or explicit mode switching. This sidesteps the need for real-time anomaly detection or force thresholding while maintaining performance across diverse tasks (Chen et al., 16 Dec 2025).

7. Insights, Limitations, and Extensions

CHIP excels at tasks requiring rapid transitions between stiff and compliant interaction modes—such as simultaneously pushing and wiping surfaces, or switching between collaboration and solo tasks. Integration into existing keypoint tracking controllers is minimal, requiring fewer than 50 additional lines of code.

Limitations include the use of a scalar compliance coefficient, restricting stiffness modulation to a single direction. Full 6 $\times$ 6 impedance adaptation would require wrist force/torque sensing. Failure modes arise when grasping very low objects (collisions from knees) or during vision-based tracking loss in occluded teleoperation; addressing these would require lower-body reference augmentation and improved vision modules.

In summary, CHIP provides a scalable, principled framework for equipping RL-based whole-body humanoid controllers with adaptive compliance, operationalized through the injection of hindsight-corrected goals under randomized training perturbations—enabling robust, generalist manipulation without compromising nominal agility (Chen et al., 16 Dec 2025).