Feel the Force: Contact-Driven Learning from Humans (2506.01944v1)

Published 2 Jun 2025 in cs.RO and cs.AI

Abstract: Controlling fine-grained forces during manipulation remains a core challenge in robotics. While robot policies learned from robot-collected data or simulation show promise, they struggle to generalize across the diverse range of real-world interactions. Learning directly from humans offers a scalable solution, enabling demonstrators to perform skills in their natural embodiment and in everyday environments. However, visual demonstrations alone lack the information needed to infer precise contact forces. We present FeelTheForce (FTF): a robot learning system that models human tactile behavior to learn force-sensitive manipulation. Using a tactile glove to measure contact forces and a vision-based model to estimate hand pose, we train a closed-loop policy that continuously predicts the forces needed for manipulation. This policy is re-targeted to a Franka Panda robot with tactile gripper sensors using shared visual and action representations. At execution, a PD controller modulates gripper closure to track predicted forces-enabling precise, force-aware control. Our approach grounds robust low-level force control in scalable human supervision, achieving a 77% success rate across 5 force-sensitive manipulation tasks. Code and videos are available at https://feel-the-force-ftf.github.io.

Authors (8)

Ademi Adeniji (6 papers)
Zhuoran Chen (3 papers)
Vincent Liu (33 papers)
Venkatesh Pattabiraman (6 papers)
Raunaq Bhirangi (10 papers)
Siddhant Haldar (15 papers)
Pieter Abbeel (372 papers)
Lerrel Pinto (81 papers)

Summary

Controlling fine-grained forces during manipulation is a significant challenge in robotics, particularly when dealing with real-world variability and delicate objects. Traditional methods often rely on extensive robot-collected data or simulation, which struggle to generalize, or teleoperation, which can be difficult and expensive to scale. Learning directly from humans offers a compelling alternative, as humans naturally exhibit sophisticated force control in everyday tasks. However, extracting precise contact force information solely from visual human demonstrations is difficult.

The paper "Feel the Force: Contact-Driven Learning from Humans" (Adeniji et al., 2 Jun 2025 ) introduces FEELTHEFORCE (FTF), a robotic learning system designed to address this challenge by learning force-sensitive manipulation directly from human tactile behavior. FTF uses a tactile glove to capture contact forces alongside vision-based hand pose estimation from human demonstrations. This data is used to train a closed-loop policy that predicts desired hand trajectories and critical contact forces. At execution time, this policy is transferred to a robot equipped with tactile gripper sensors, and a low-level PD controller actively modulates the robot's gripper closure to track the predicted forces.

Here's a breakdown of FTF's implementation and practical application:

1. Data Acquisition

Human Demonstrations: Data is collected by having a human wear a custom ergonomic tactile glove and perform manipulation tasks naturally. Two calibrated RealSense cameras record the scene (hand and environment) from different viewpoints.
Tactile Glove: Inspired by AnySkin (Bhirangi et al., 12 Sep 2024 ), the glove uses magnetometer-based sensors primarily on the underside of the thumb. These sensors capture 3D force vectors. For FTF, the norm of the center magnetometer's force vector is aggregated over time to match the camera frame rate (30 fps), providing a continuous force reading.
Force-to-Newton Mapping: A mapping between the sensor norm and applied force (in Newtons) is calibrated by pressing the sensor on a weighing scale in various contact modes (Figure 6). This allows the raw sensor output to be translated into a force value.
Robot Hardware: A Franka Panda robot with custom 3D-printed gripper tips is used for deployment. These tips have mounts for AnySkin tactile sensors on one fingertip, mirroring the human glove setup.

2. Embodiment Agnostic Scene Representation

To bridge the gap between human and robot embodiments, FTF converts observations into a unified point-based representation:

Human-to-Robot Transfer: Mediapipe (Lugaresi et al., 2019 ) extracts 2D keypoints from human hand images. Triangulation from the two camera views provides 3D hand keypoints. The robot's end-effector position is computed as the midpoint between the thumb and index finger tips. The robot's orientation is derived using a rigid transform between the initial hand pose and the current hand pose. N robot keypoints are then defined relative to this computed robot pose using predefined rigid transformations, creating a point-based representation of the robot's state.
Scene Keypoints: Task-relevant objects are represented by sparse 3D keypoints. These are initially annotated by a human on one frame. DIFT (Tang et al., 2023 ) semantically propagates these annotations to the first frames of other demonstrations. Co-Tracker (Karaev et al., 2023 ) then tracks these points through each demonstration sequence, handling occlusions. Triangulation provides the 3D object keypoints in the robot's base frame. This approach leverages pre-trained vision models for generalization to novel object instances at inference time.

3. Policy Learning

Architecture: A Transformer policy (Haldar et al., 11 Jun 2024 , Haldar et al., 27 Feb 2025 ) is used.
Inputs: The policy takes a history of observations as input: the tracked 3D robot points, 3D object points, the binarized robot gripper state (open/closed), and the continuous force value measured by the tactile glove. Gripper state and force are repeated to match the dimensionality of the point tracks.
Output: The policy predicts future trajectories for the robot points, the future robot gripper state (binary or continuous), and future gripper force predictions.
Training: The policy is trained using a mean squared error (MSE) loss between the predicted and demonstrated values. Action chunking with exponential temporal averaging is used for smoother predicted trajectories.

4. Inference and Force Control

The trained policy is deployed on the robot in a closed-loop manner.

Robot Pose Calculation: Predicted robot keypoints are mapped back to a robot pose (position and orientation) using rigid-body geometry.
PD Force Controller: A critical component is the inference-time outer-loop PD controller responsible for achieving the predicted forces.
- When the policy predicts a desired force $\hat{F}_t$ for time step $t$ , the controller adjusts the target gripper closure.
- The gripper closure update $\Delta g_t$ at an inner loop timestep $\tau$ (different from the policy frequency) is calculated based on the difference between the predicted force $\hat{F}_t$ and the currently measured force $F^f_t$ from the robot's tactile sensor:
  
  $\Delta g_t = k \cdot (\hat{F}_t - F^f_t) \cdot \tau$
  
  where $k$ is a proportional gain. The total gripper closure is updated iteratively: $g_{t+1} = g_t + \Delta g_t$ .
- This PD controller runs iteratively until the measured force $F^f_t$ is sufficiently close to the predicted force $\hat{F}_t$ , specifically when $| \hat{F}_t - F^f_t | \leq \epsilon$ .
- Once the force converges, the robot executes the predicted end-effector pose, gripper state, and force from the policy for step $t$ , reads the next state $s_{t+1}$ , and the policy predicts the next action chunk for step $t+1$ .
- The robot actions (pose and gripper state) are executed at approximately 6Hz, while the PD controller runs at a higher frequency (e.g., the tactile sensor frequency) to track force within each policy step.

Here's a pseudocode representation of the inference process:

// Algorithm 2: FTF Policy Inference
Obtain initial object keypoints (using DIFT on an annotated frame)

for each time step t in rollout:
  Compute action chunk (â_t, ..., a_{t+H}) using Transformer policy T(a | state_t)
  Apply temporal aggregation to get the action for step t: â_t
  Parse action â_t into predicted robot points (Ê_t), predicted gripper state (ĝ_t), predicted force (Ê_t)

  Convert predicted robot points Ê_t to robot pose Rpose_t

  if ĝ_t > closure_threshold:
    // Algorithm 1: FORCEFEEDBACK_GRIPPER_CONTROL(Ê_t)
    Initialize gripper_closure_t <- current robot gripper closure
    repeat:
      Read current force F^f_t from robot tactile sensor
      // Calculate gripper adjustment using PD (proportional term shown as per paper)
      delta_gripper = k * (Ê_t - F^f_t) * inner_loop_timestep_tau
      gripper_closure_t = gripper_closure_t + delta_gripper
      Execute temporary gripper closure command (e.g., adjust gripper position/effort)
      // Optional: small delay or wait for stable force reading
    until | Ê_t - F^f_t | <= epsilon

    // After force converges, set final gripper state for this policy step
    Execute robot pose Rpose_t
    Execute converged gripper_closure_t
  else if ĝ_t < open_threshold:
    Execute robot pose Rpose_t
    Open gripper
  else: // e.g., maintain current gripper state
    Execute robot pose Rpose_t
    // Execute current gripper state (or maintain)

  Read next state state_{t+1} (using Co-Tracker for object points, robot sensors for own state)
end for

5. Experiments and Evaluation

The paper evaluates FTF on a Franka Panda robot in real-world tabletop tasks requiring force sensitivity:

Tasks: Place soft bread on plate (avoid crushing), unstack single plastic cup (isolate one cup), place egg in pot (avoid crushing), place bag of chips on plate (avoid crushing), twist and lift bottle cap (apply correct force to unscrew).
Baselines: Compared FTF to variations of point policies and P3-PO (Levy et al., 9 Dec 2024 ) baselines using passive force inputs, different gripper action spaces (binary vs. continuous), and trained on either human or robot teleoperation data.
Results (Table 1 & 2): FTF demonstrated significantly higher success rates (77% average across tasks) compared to baselines, particularly for tasks requiring precise force control (e.g., unstacking one cup, handling deformable objects). Baselines using passive force or simply mapping continuous human finger separation to robot gripper closure struggled. FTF also generally outperformed baselines trained on robot teleoperation data, suggesting the quality and naturalness of human-collected tactile data.
Robustness (Table 5): FTF showed robustness to adversarial disturbances during execution (e.g., holding down a bag of chips), maintaining a 67% success rate in the perturbed "Place bag of chips" task. This indicates that the active force feedback control helps the robot adapt to unexpected tactile interactions.
Force Input Necessity (Table 3): An ablation showed that FTF could still perform effectively without explicit force input to the Transformer policy, implying that the model can infer required forces from visual and proprioceptive state and that the PD control loop is key to achieving them.

6. Practical Implications and Limitations

FTF offers a practical approach to endowing robots with force-aware manipulation skills by leveraging readily available human expertise. Learning from natural human demonstrations is more scalable than traditional robot teleoperation, especially for diverse real-world scenarios. The decoupling of learning (predicting force) and execution (tracking force via PD control) is crucial for robustness to embodiment differences and test-time perturbations.

Potential applications include manufacturing tasks requiring precise force, handling delicate or deformable objects in logistics or healthcare, and enabling robots to perform household chores that rely on fine-grained touch.

Current limitations noted by the authors include:

Aggregating shear and normal forces into a single norm loses directional information, which might be necessary for more complex dexterous tasks.
The data collection relies on a fixed, calibrated camera setup, limiting "in-the-wild" data collection. Using egocentric cameras and stereo triangulation could be a future direction.

Implementing FTF would require:

Developing or acquiring suitable tactile sensors for both human (glove) and robot (gripper).
Setting up a calibrated multi-camera system for 3D tracking.
Implementing the point-based representation pipeline (using libraries like MediaPipe, DIFT, Co-Tracker, and custom triangulation/retargeting code).
Training a Transformer policy architecture (using frameworks like PyTorch or TensorFlow).
Implementing the real-time inference loop, including the inner PD control loop for force tracking. This requires careful synchronization between policy execution frequency and the high-frequency force feedback loop.
Calibration procedures for sensors, cameras, and the force-to-Newton mapping.

Computational requirements would involve running the vision pipelines (Mediapipe, DIFT, Co-Tracker) and the Transformer policy inference in real-time, along with the high-frequency PD control loop. The vision processing and Transformer inference might require GPU acceleration. Calibration is a one-time effort but needs precision. Deploying the PD controller requires low-latency communication with the robot's joint/end-effector controllers.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Feel the Force

Tweets

https://twitter.com/AdemiAdeniji/status/1929976545059463520

https://twitter.com/JoliaChen/status/1929979980294881379

https://twitter.com/AdemiAdeniji/status/1929976566840238272

https://twitter.com/_casey_ash/status/1930519963179106619