V-HOP: Visuo-Haptic 6D Object Pose Tracking (2502.17434v1)

Published 24 Feb 2025 in cs.RO, cs.AI, and cs.CV

Abstract: Humans naturally integrate vision and haptics for robust object perception during manipulation. The loss of either modality significantly degrades performance. Inspired by this multisensory integration, prior object pose estimation research has attempted to combine visual and haptic/tactile feedback. Although these works demonstrate improvements in controlled environments or synthetic datasets, they often underperform vision-only approaches in real-world settings due to poor generalization across diverse grippers, sensor layouts, or sim-to-real environments. Furthermore, they typically estimate the object pose for each frame independently, resulting in less coherent tracking over sequences in real-world deployments. To address these limitations, we introduce a novel unified haptic representation that effectively handles multiple gripper embodiments. Building on this representation, we introduce a new visuo-haptic transformer-based object pose tracker that seamlessly integrates visual and haptic input. We validate our framework in our dataset and the Feelsight dataset, demonstrating significant performance improvement on challenging sequences. Notably, our method achieves superior generalization and robustness across novel embodiments, objects, and sensor types (both taxel-based and vision-based tactile sensors). In real-world experiments, we demonstrate that our approach outperforms state-of-the-art visual trackers by a large margin. We further show that we can achieve precise manipulation tasks by incorporating our real-time object tracking result into motion plans, underscoring the advantages of visuo-haptic perception. Our model and dataset will be made open source upon acceptance of the paper. Project website: https://lhy.xyz/projects/v-hop/

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces V-HOP, a novel visuo-haptic transformer model that fuses visual and tactile data for accurate 6D object pose tracking.
It employs a unified haptic representation that converts tactile sensor data into point clouds, integrating both taxel- and vision-based signals across diverse gripper embodiments.
Extensive experiments demonstrate V-HOP’s superior performance, achieving up to 5% improvement in key metrics and 10x faster tracking than prior methods.

The paper "V-HOP: Visuo-Haptic 6D Object Pose Tracking" introduces a method, V-HOP, for 6D object pose tracking that integrates visual and haptic information. It addresses limitations in existing approaches that often underperform in real-world settings due to poor generalization across different grippers, sensor layouts, or sim-to-real environments, and which typically estimate object pose independently for each frame. The key contributions are a novel unified haptic representation and a visuo-haptic transformer-based object pose tracker.

The unified haptic representation handles multiple gripper embodiments by considering both tactile and kinesthetic information in the form of a point cloud. This representation spans both taxel-based and vision-based sensors. The visuo-haptic transformer leverages the robust visual prior captured by the visual foundation model while incorporating haptics. The method is validated on a custom dataset and the Feelsight dataset, and demonstrates generalization and robustness across novel embodiments, objects, and sensor types. Real-world experiments show that V-HOP outperforms state-of-the-art visual trackers and achieves precise manipulation tasks when its real-time object tracking result is incorporated into motion plans.

The problem is defined as estimating the object pose $\widehat{\mathbf{T}_i}$ at each timestep $i$ , given a CAD model $\mathcal{M}_o$ , a sequence of RGB-D images $\mathcal{O} = \{ \mathbf{O}_i \}_{i=1}^t$ , and an initial 6D pose $\mathbf{T}_0 = (\mathbf{R}_0, \mathbf{t}_0) \in \text{SE}(3)$ . Additional inputs include a gripper description in Unified Robot Description Format (URDF), gripper joint positions $\mathbf{j} = \{j_1, j_2, \dots, j_{DoF}\}$ , and tactile sensor data $\mathcal{S}$ , including positions $\mathcal{S}_p$ and readings $\mathcal{S}_r$ of tactile sensors.

Haptic representation involves converting tactile sensor data into point clouds, broadly classifying tactile sensors into taxel-based and vision-based. For taxel-based sensors, tactile data $\mathcal{S} = \{ s_i \}_{i=1}^{n_t}$ encapsulates $n_t$ taxels, where $s_i$ represents individual taxels. The tactile data consists of $\mathcal{S} = (\mathcal{S}_p, \mathcal{S}_r)$ , where $\mathcal{S}_p$ represents positions defined in the gripper frame and transformed into the camera frame, and $\mathcal{S}_r$ captures contact values. Readings are commonly binarized into contact or no-contact states based on a threshold $\tau$ . The set of taxels in contact is $\mathcal{S}_c = \{ s_i \in \mathcal{S} \mid \mathcal{S}_r(s_i) > \tau \}$ , and the corresponding tactile point cloud $\mathcal{S}_{p, c}$ is defined as $\mathcal{S}_{p, c} = \{ \mathcal{S}_p(s_i) \mid s_i \in \mathcal{S}_c \}$ . For vision-based sensors, the tactile data includes $\mathcal{S} = (\mathcal{S}_p, \mathcal{S}_I)$ , where $\mathcal{S}_p$ represents sensor positions in the camera frame, similar to taxel-based, and $\mathcal{S}_I$ captures contact states using regular Red Green Blue (RGB) image representation. A tactile depth estimation model can convert $\mathcal{S}_I$ into a tactile point cloud $\mathcal{S}_{p, c}$ .

The method, V-HOP, fuses visual and haptic modalities to achieve accurate 6D object pose tracking. It uses hand and object representations, following the render-and-compare paradigm. Tactile signals only represent the cutaneous stimulation, while haptic sensing combines tactile and kinesthetic feedback. A novel haptic representation integrates tactile signals and hand posture in a unified point cloud representation. Using the URDF definition and joint positions $\mathbf{j}$ , the hand mesh $\mathcal{M}_h$ is generated through forward kinematics and surface normals are calculated. The mesh is then downsampled to produce a 9-D hand point cloud $\mathcal{P}_h = \{ \mathbf{p}_i \}_{i=1}^{n_h}$ , where each point $\mathbf{p}_i = (x_i, y_i, z_i, n_{ix}, n_{iy}, n_{iz}, \mathbf{c}) \in \mathbb{R}^9$ .

$x_i, y_i, z_i$ represent the 3-D coordinate of the point; $n_{ix}, n_{iy}, n_{iz}$ represent the 3-D normal vectors; and $\mathbf{c} \in \mathbb{R}^3$ is a one-hot encoded point label:

$[1, 0, 0]$ : Hand point in contact
$[0, 1, 0]$ : Hand point not in contact
$[0, 0, 1]$ : Object point

To obtain the contact state of each point, the tactile point cloud $\mathcal{S}_{p, c}$ is mapped onto the downsampled hand point cloud $\mathcal{P}_h$ . For each point in $\mathcal{S}_{p, c}$ , its neighboring points in $\mathcal{P}_h$ within a radius $r$ are found and labeled as "in contact", while all others are labeled as "not in contact". The object model point cloud is denoted as $\mathcal{P}_\Phi = \{ \mathbf{q}_i \}_{i=1}^{n_o}$ , where each point $\mathbf{q}_i$ follows the same 9-D definitions, with $\mathbf{c} = [0, 0, 1]$ for all object points. At each timestep $i>0$ , the model point cloud is transformed into a hypothesized point cloud $\mathcal{P}_o = \{ \mathbf{q}'_i \}_{i=1}^{n_o}$ according to the pose from the previous timestep $\mathbf{T}_{i-1}$ . The hand point cloud $\mathcal{P}_h$ and the hypothesized object point cloud $\mathcal{P}_o$ are fused to create a hand-object point cloud $\mathcal{P}$ , where $\mathcal{P} = \mathcal{P}_h \cup \mathcal{P}_o$ .

The network design involves a visual modality and a haptic modality. The visual modality uses a visual encoder $f_v$ to transform the Red Green Blue - Depth (RGB-D) observation into visual embeddings $\mathbf{Z}_v = f_v(\mathbf{O})$ . The haptic modality encodes the hand-object point cloud $\mathcal{P}$ using a haptic encoder $f_h$ , resulting in a haptic embedding ${\mathbf{Z}_h = f_h(\mathcal{P})}$ . The visual encoder $f_v$ is frozen during training, and PointNet++ is used as the haptic encoder $f_h$ . The visual embedding $\mathbf{Z}_v$ and haptic embedding $\mathbf{Z}_h$ are fed into Transformer encoders, which are fine-tuned along with the haptic encoder $f_h$ . The model estimates 3D translation and 3D rotation using two output heads.

The model is trained by adding noise $(\mathbf{R}_\epsilon, \mathbf{t}_\epsilon)$ to the ground-truth pose $\mathbf{T}=(\mathbf{R}, \mathbf{t})$ to create the hypothesis pose $\widetilde{\mathbf{T}=(\widetilde{\mathbf{R},\widetilde{\mathbf{t})}$. The model estimates the relative pose $\Delta\widehat{\mathbf{T}=(\Delta\widehat{\mathbf{R}, \Delta\widehat{\mathbf{t})$ between the pose hypothesis and observation. The model is optimized using the $L_2$ loss:

$\mathcal{L}_\mathbf{T} = \Vert \Delta\widehat{\mathbf{R} - \mathbf{R}_\epsilon \Vert_2 + \Vert \Delta\widehat{\mathbf{t} - \mathbf{t}_\epsilon \Vert_2$, where quaternion representations are used for rotations. The estimated pose $\widehat{\mathbf{T}=(\widehat{\mathbf{R}, \widehat{\mathbf{t})$ is $\widehat{\mathbf{R} = \Delta\widehat{\mathbf{R} \cdot \widetilde{\mathbf{R}$ and $\widehat{\mathbf{t} = \Delta\widehat{\mathbf{t} + \widetilde{\mathbf{t} .}$

$\mathcal{L}_\mathbf{T}$ : The $L_2$ loss for pose estimation
$\Delta\widehat{\mathbf{R}}$ : The estimated relative rotation
$\mathbf{R}_\epsilon$ : The rotation noise
$\Delta\widehat{\mathbf{t}}$ : The estimated relative translation
$\mathbf{t}_\epsilon$ : The translation noise

An attractive loss ( $\mathcal{L}_a$ ) and a penetration loss ( $\mathcal{L}_p$ ) are incorporated to encourage the object to make contact with the tactile point cloud $\mathcal{S}_{p, c}$ and avoid penetrating the hand point cloud $\mathcal{P}_h$ . The overall loss is $\mathcal{L} = \mathcal{L}_\mathbf{T} + \alpha \mathcal{L}_a + \beta \mathcal{L}_p$ .

$\mathcal{L}$ : The overall loss function
$\mathcal{L}_\mathbf{T}$ : The $L_2$ loss for pose estimation
$\mathcal{L}_a$ : The attractive loss
$\mathcal{L}_p$ : The penetration loss
$\alpha$ : The weight of the attractive loss
$\beta$ : The weight of the penetration loss

A multi-embodied dataset was created using NVIDIA Isaac Sim, comprising approximately 1,550,000 images collected across eight grippers and thirteen objects. Performance is evaluated using the area under the curve (AUC) of ADD and ADD-S, and ADD(-S)-0.1d. V-HOP is compared against FoundationPose and ViTa. V-HOP consistently outperforms ViTa and FoundationPose on most objects with respect to ADD and across all objects in terms of ADD-S. On average, it delivers an improvement of 4% in ADD and 5% in ADD-S compared to FoundationPose.

Ablation studies on input modalities show that both visual and tactile input are crucial. The performance of V-HOP and FoundationPose is evaluated across varying occlusion ratios. V-HOP consistently outperforms FoundationPose in both ADD and ADD-S metrics under different levels of occlusion. Benchmarking against NeuralFeels using the Feelsight dataset shows that V-HOP achieves a 32% lower ADD-S error compared to NeuralFeels and is approximately 10 times faster.

Sim-to-real transfer experiments are performed using a robot platform with dual Franka Research 3 robotic arms and Barrett Hands BH8-282. In pose tracking experiments, V-HOP maintains stable object tracking throughout the trajectory, while FoundationPose often loses tracking. In bimanual handover experiments, V-HOP has 40% higher average task success rates compared to FoundationPose. The Can-in-Mug task demonstrates that V-HOP delivers more stable tracking and a higher overall success rate. Studies on the contribution of visual and haptic inputs to the final prediction suggest that when the gripper is not in contact with an object, the model predominantly relies on visual inputs. However, as the gripper establishes contact and occlusion becomes more severe, the model increasingly shifts its reliance toward haptic inputs.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (6)

Tweets

https://twitter.com/Hongyu_Lii/status/1894384152322945416