Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 34 tok/s Pro
GPT-4o 108 tok/s
GPT OSS 120B 473 tok/s Pro
Kimi K2 218 tok/s Pro
2000 character limit reached

V-HOP: Visuo-Haptic 6D Object Pose Tracking (2502.17434v1)

Published 24 Feb 2025 in cs.RO, cs.AI, and cs.CV

Abstract: Humans naturally integrate vision and haptics for robust object perception during manipulation. The loss of either modality significantly degrades performance. Inspired by this multisensory integration, prior object pose estimation research has attempted to combine visual and haptic/tactile feedback. Although these works demonstrate improvements in controlled environments or synthetic datasets, they often underperform vision-only approaches in real-world settings due to poor generalization across diverse grippers, sensor layouts, or sim-to-real environments. Furthermore, they typically estimate the object pose for each frame independently, resulting in less coherent tracking over sequences in real-world deployments. To address these limitations, we introduce a novel unified haptic representation that effectively handles multiple gripper embodiments. Building on this representation, we introduce a new visuo-haptic transformer-based object pose tracker that seamlessly integrates visual and haptic input. We validate our framework in our dataset and the Feelsight dataset, demonstrating significant performance improvement on challenging sequences. Notably, our method achieves superior generalization and robustness across novel embodiments, objects, and sensor types (both taxel-based and vision-based tactile sensors). In real-world experiments, we demonstrate that our approach outperforms state-of-the-art visual trackers by a large margin. We further show that we can achieve precise manipulation tasks by incorporating our real-time object tracking result into motion plans, underscoring the advantages of visuo-haptic perception. Our model and dataset will be made open source upon acceptance of the paper. Project website: https://lhy.xyz/projects/v-hop/

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces V-HOP, a novel visuo-haptic transformer model that fuses visual and tactile data for accurate 6D object pose tracking.
  • It employs a unified haptic representation that converts tactile sensor data into point clouds, integrating both taxel- and vision-based signals across diverse gripper embodiments.
  • Extensive experiments demonstrate V-HOP’s superior performance, achieving up to 5% improvement in key metrics and 10x faster tracking than prior methods.

The paper "V-HOP: Visuo-Haptic 6D Object Pose Tracking" introduces a method, V-HOP, for 6D object pose tracking that integrates visual and haptic information. It addresses limitations in existing approaches that often underperform in real-world settings due to poor generalization across different grippers, sensor layouts, or sim-to-real environments, and which typically estimate object pose independently for each frame. The key contributions are a novel unified haptic representation and a visuo-haptic transformer-based object pose tracker.

The unified haptic representation handles multiple gripper embodiments by considering both tactile and kinesthetic information in the form of a point cloud. This representation spans both taxel-based and vision-based sensors. The visuo-haptic transformer leverages the robust visual prior captured by the visual foundation model while incorporating haptics. The method is validated on a custom dataset and the Feelsight dataset, and demonstrates generalization and robustness across novel embodiments, objects, and sensor types. Real-world experiments show that V-HOP outperforms state-of-the-art visual trackers and achieves precise manipulation tasks when its real-time object tracking result is incorporated into motion plans.

The problem is defined as estimating the object pose Ti^\widehat{\mathbf{T}_i} at each timestep ii, given a CAD model Mo\mathcal{M}_o, a sequence of RGB-D images O={Oi}i=1t\mathcal{O} = \{ \mathbf{O}_i \}_{i=1}^t, and an initial 6D pose T0=(R0,t0)SE(3)\mathbf{T}_0 = (\mathbf{R}_0, \mathbf{t}_0) \in \text{SE}(3). Additional inputs include a gripper description in Unified Robot Description Format (URDF), gripper joint positions j={j1,j2,,jDoF}\mathbf{j} = \{j_1, j_2, \dots, j_{DoF}\}, and tactile sensor data S\mathcal{S}, including positions Sp\mathcal{S}_p and readings Sr\mathcal{S}_r of tactile sensors.

Haptic representation involves converting tactile sensor data into point clouds, broadly classifying tactile sensors into taxel-based and vision-based. For taxel-based sensors, tactile data S={si}i=1nt\mathcal{S} = \{ s_i \}_{i=1}^{n_t} encapsulates ntn_t taxels, where sis_i represents individual taxels. The tactile data consists of S=(Sp,Sr)\mathcal{S} = (\mathcal{S}_p, \mathcal{S}_r), where Sp\mathcal{S}_p represents positions defined in the gripper frame and transformed into the camera frame, and Sr\mathcal{S}_r captures contact values. Readings are commonly binarized into contact or no-contact states based on a threshold τ\tau. The set of taxels in contact is Sc={siSSr(si)>τ}\mathcal{S}_c = \{ s_i \in \mathcal{S} \mid \mathcal{S}_r(s_i) > \tau \}, and the corresponding tactile point cloud Sp,c\mathcal{S}_{p, c} is defined as Sp,c={Sp(si)siSc}\mathcal{S}_{p, c} = \{ \mathcal{S}_p(s_i) \mid s_i \in \mathcal{S}_c \}. For vision-based sensors, the tactile data includes S=(Sp,SI)\mathcal{S} = (\mathcal{S}_p, \mathcal{S}_I), where Sp\mathcal{S}_p represents sensor positions in the camera frame, similar to taxel-based, and SI\mathcal{S}_I captures contact states using regular Red Green Blue (RGB) image representation. A tactile depth estimation model can convert SI\mathcal{S}_I into a tactile point cloud Sp,c\mathcal{S}_{p, c}.

The method, V-HOP, fuses visual and haptic modalities to achieve accurate 6D object pose tracking. It uses hand and object representations, following the render-and-compare paradigm. Tactile signals only represent the cutaneous stimulation, while haptic sensing combines tactile and kinesthetic feedback. A novel haptic representation integrates tactile signals and hand posture in a unified point cloud representation. Using the URDF definition and joint positions j\mathbf{j}, the hand mesh Mh\mathcal{M}_h is generated through forward kinematics and surface normals are calculated. The mesh is then downsampled to produce a 9-D hand point cloud Ph={pi}i=1nh\mathcal{P}_h = \{ \mathbf{p}_i \}_{i=1}^{n_h}, where each point pi=(xi,yi,zi,nix,niy,niz,c)R9\mathbf{p}_i = (x_i, y_i, z_i, n_{ix}, n_{iy}, n_{iz}, \mathbf{c}) \in \mathbb{R}^9.

xi,yi,zix_i, y_i, z_i represent the 3-D coordinate of the point; nix,niy,nizn_{ix}, n_{iy}, n_{iz} represent the 3-D normal vectors; and cR3\mathbf{c} \in \mathbb{R}^3 is a one-hot encoded point label:

  • [1,0,0][1, 0, 0]: Hand point in contact
  • [0,1,0][0, 1, 0]: Hand point not in contact
  • [0,0,1][0, 0, 1]: Object point

To obtain the contact state of each point, the tactile point cloud Sp,c\mathcal{S}_{p, c} is mapped onto the downsampled hand point cloud Ph\mathcal{P}_h. For each point in Sp,c\mathcal{S}_{p, c}, its neighboring points in Ph\mathcal{P}_h within a radius rr are found and labeled as "in contact", while all others are labeled as "not in contact". The object model point cloud is denoted as PΦ={qi}i=1no\mathcal{P}_\Phi = \{ \mathbf{q}_i \}_{i=1}^{n_o}, where each point qi\mathbf{q}_i follows the same 9-D definitions, with c=[0,0,1]\mathbf{c} = [0, 0, 1] for all object points. At each timestep i>0i>0, the model point cloud is transformed into a hypothesized point cloud Po={qi}i=1no\mathcal{P}_o = \{ \mathbf{q}'_i \}_{i=1}^{n_o} according to the pose from the previous timestep Ti1\mathbf{T}_{i-1}. The hand point cloud Ph\mathcal{P}_h and the hypothesized object point cloud Po\mathcal{P}_o are fused to create a hand-object point cloud P\mathcal{P}, where P=PhPo\mathcal{P} = \mathcal{P}_h \cup \mathcal{P}_o.

The network design involves a visual modality and a haptic modality. The visual modality uses a visual encoder fvf_v to transform the Red Green Blue - Depth (RGB-D) observation into visual embeddings Zv=fv(O)\mathbf{Z}_v = f_v(\mathbf{O}). The haptic modality encodes the hand-object point cloud P\mathcal{P} using a haptic encoder fhf_h, resulting in a haptic embedding Zh=fh(P){\mathbf{Z}_h = f_h(\mathcal{P})}. The visual encoder fvf_v is frozen during training, and PointNet++ is used as the haptic encoder fhf_h. The visual embedding Zv\mathbf{Z}_v and haptic embedding Zh\mathbf{Z}_h are fed into Transformer encoders, which are fine-tuned along with the haptic encoder fhf_h. The model estimates 3D translation and 3D rotation using two output heads.

The model is trained by adding noise (Rϵ,tϵ)(\mathbf{R}_\epsilon, \mathbf{t}_\epsilon) to the ground-truth pose T=(R,t)\mathbf{T}=(\mathbf{R}, \mathbf{t}) to create the hypothesis pose $\widetilde{\mathbf{T}=(\widetilde{\mathbf{R},\widetilde{\mathbf{t})}$. The model estimates the relative pose $\Delta\widehat{\mathbf{T}=(\Delta\widehat{\mathbf{R}, \Delta\widehat{\mathbf{t})$ between the pose hypothesis and observation. The model is optimized using the L2L_2 loss:

$\mathcal{L}_\mathbf{T} = \Vert \Delta\widehat{\mathbf{R} - \mathbf{R}_\epsilon \Vert_2 + \Vert \Delta\widehat{\mathbf{t} - \mathbf{t}_\epsilon \Vert_2$, where quaternion representations are used for rotations. The estimated pose $\widehat{\mathbf{T}=(\widehat{\mathbf{R}, \widehat{\mathbf{t})$ is $\widehat{\mathbf{R} = \Delta\widehat{\mathbf{R} \cdot \widetilde{\mathbf{R}$ and $\widehat{\mathbf{t} = \Delta\widehat{\mathbf{t} + \widetilde{\mathbf{t} .}$

  • LT\mathcal{L}_\mathbf{T}: The L2L_2 loss for pose estimation
  • ΔR^\Delta\widehat{\mathbf{R}}: The estimated relative rotation
  • Rϵ\mathbf{R}_\epsilon: The rotation noise
  • Δt^\Delta\widehat{\mathbf{t}}: The estimated relative translation
  • tϵ\mathbf{t}_\epsilon: The translation noise

An attractive loss (La\mathcal{L}_a) and a penetration loss (Lp\mathcal{L}_p) are incorporated to encourage the object to make contact with the tactile point cloud Sp,c\mathcal{S}_{p, c} and avoid penetrating the hand point cloud Ph\mathcal{P}_h. The overall loss is L=LT+αLa+βLp\mathcal{L} = \mathcal{L}_\mathbf{T} + \alpha \mathcal{L}_a + \beta \mathcal{L}_p.

  • L\mathcal{L}: The overall loss function
  • LT\mathcal{L}_\mathbf{T}: The L2L_2 loss for pose estimation
  • La\mathcal{L}_a: The attractive loss
  • Lp\mathcal{L}_p: The penetration loss
  • α\alpha: The weight of the attractive loss
  • β\beta: The weight of the penetration loss

A multi-embodied dataset was created using NVIDIA Isaac Sim, comprising approximately 1,550,000 images collected across eight grippers and thirteen objects. Performance is evaluated using the area under the curve (AUC) of ADD and ADD-S, and ADD(-S)-0.1d. V-HOP is compared against FoundationPose and ViTa. V-HOP consistently outperforms ViTa and FoundationPose on most objects with respect to ADD and across all objects in terms of ADD-S. On average, it delivers an improvement of 4% in ADD and 5% in ADD-S compared to FoundationPose.

Ablation studies on input modalities show that both visual and tactile input are crucial. The performance of V-HOP and FoundationPose is evaluated across varying occlusion ratios. V-HOP consistently outperforms FoundationPose in both ADD and ADD-S metrics under different levels of occlusion. Benchmarking against NeuralFeels using the Feelsight dataset shows that V-HOP achieves a 32% lower ADD-S error compared to NeuralFeels and is approximately 10 times faster.

Sim-to-real transfer experiments are performed using a robot platform with dual Franka Research 3 robotic arms and Barrett Hands BH8-282. In pose tracking experiments, V-HOP maintains stable object tracking throughout the trajectory, while FoundationPose often loses tracking. In bimanual handover experiments, V-HOP has 40% higher average task success rates compared to FoundationPose. The Can-in-Mug task demonstrates that V-HOP delivers more stable tracking and a higher overall success rate. Studies on the contribution of visual and haptic inputs to the final prediction suggest that when the gripper is not in contact with an object, the model predominantly relies on visual inputs. However, as the gripper establishes contact and occlusion becomes more severe, the model increasingly shifts its reliance toward haptic inputs.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com