VLM6D: VLM based 6Dof Pose Estimation based on RGB-D Images

Published 31 Oct 2025 in cs.CV and cs.AI | (2511.00120v1)

Abstract: The primary challenge in computer vision is precisely calculating the pose of 6D objects, however many current approaches are still fragile and have trouble generalizing from synthetic data to real-world situations with fluctuating lighting, textureless objects, and significant occlusions. To address these limitations, VLM6D, a novel dual-stream architecture that leverages the distinct strengths of visual and geometric data from RGB-D input for robust and precise pose estimation. Our framework uniquely integrates two specialized encoders: a powerful, self-supervised Vision Transformer (DINOv2) processes the RGB modality, harnessing its rich, pre-trained understanding of visual grammar to achieve remarkable resilience against texture and lighting variations. Concurrently, a PointNet++ encoder processes the 3D point cloud derived from depth data, enabling robust geometric reasoning that excels even with the sparse, fragmented data typical of severe occlusion. These complementary feature streams are effectively fused to inform a multi task prediction head. We demonstrate through comprehensive experiments that VLM6D obtained new SOTA performance on the challenging Occluded-LineMOD, validating its superior robustness and accuracy.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a dual-stream approach that fuses DINOv2 and PointNet++ to accurately process RGB and depth data for 6DoF pose estimation.
The model outperforms state-of-the-art methods on the Occluded LineMOD benchmark using the ADD(-S) metric, demonstrating high resilience to occlusions and textureless environments.
The innovative fusion of visual and geometric features paves the way for advancements in robotics, augmented reality, and object recognition applications.

VLM6D: Robust 6Dof Pose Estimation Using RGB-D Data

Introduction

The paper "VLM6D: VLM based 6Dof Pose Estimation based on RGB-D Images" presents a novel approach leveraging Vision LLMs (VLM) for addressing the challenges in 6DoF object pose estimation from RGB-D images. Conventional methods often suffer in real-world scenarios due to their inability to generalize well from synthetic datasets, resulting in poor performance under varying lighting conditions, occlusions, and textureless environments. The proposed VLM6D model is designed to overcome these limitations by utilizing a dual-stream architecture composed of two specialized encoders: DINOv2 for processing RGB data and PointNet++ for depth data, facilitating robust and precise pose estimation.

Methodology

VLM6D employs a dual-stream design, where each stream processes different modalities of the input data independently. The RGB modality is handled by DINOv2, a Vision Transformer pre-trained on a sizable collection of unlabeled images, enabling it to possess generalized resilience against texture and lighting variations. It decomposes the RGB input into non-overlapping patches, projecting each into a high-dimensional space, which is subsequently processed through transformer layers to distill a robust representation.

Conversely, the depth stream leverages PointNet++, which operates directly on 3D point clouds, capturing geometric features even in instances of substantial occlusion. This architecture features sampling and feature aggregation layers to capture both local and global geometric properties, making it apt for handling sparse and fragmented input data.

The fusion of these independent streams uses concatenated feature vectors processed through a sequence of layers incorporating ReLU activations and dropout regularization. The design concludes with multi-task prediction heads dedicated to rotation, translation, confidence scoring, and object classification.

Experimental Results

On the Occluded LineMOD (LMO) benchmark, a notoriously challenging dataset, VLM6D is evaluated against existing state-of-the-art methods in terms of the ADD(-S) metric. The model consistently outperforms existing solutions, highlighting its robust performance in conditions with substantial occlusion, textureless surfaces, and reflective environments. The inclusion of DINOv2 proves crucial for enhancing resilience to appearances while PointNet++ reinforces the model’s geometric reasoning capabilities, ensuring high precision and accuracy.

Implications and Future Directions

VLM6D represents a significant advancement in 6DoF pose estimation, primarily due to the innovative integration of dual modalities—combining the visual comprehensiveness of VLMs with the geometric robustness of point cloud processing. This architecture paves the way for further exploration into exploiting self-supervised learning models in combination with 3D geometric computations to enhance object recognition and interaction accuracy in robotics and augmented reality environments.

Future research could focus on refining the multi-task prediction architecture, further expanding its capability to handle more diverse datasets that encompass wider real-world scenarios. Additionally, extending the model’s efficiency and scalability for real-time applications remains a promising domain for exploration.

Conclusion

VLM6D effectively demonstrates a synergistic approach to 6DoF pose estimation by deploying complementary streams for visual and geometric data, yielding significant improvements in robustness and accuracy. This dual-stream paradigm may serve as a foundational framework for future work aimed at tackling intricate pose estimation challenges in complex environments.

Markdown