- The paper introduces SGPose, a novel framework that estimates 6D object poses from sparse monocular views without relying on CAD models.
- It utilizes Gaussian-based methods, including depth rendering and synthetic view warping, to generate dense 2D-3D correspondences from limited inputs.
- Experiments on LM and Occlusion LM-O datasets demonstrate that SGPose outperforms existing methods, achieving competitive results with minimal views.
Object Gaussian for Monocular 6D Pose Estimation from Sparse Views
The paper "Object Gaussian for Monocular 6D Pose Estimation from Sparse Views" introduces SGPose, a novel framework that leverages Gaussian-based methods to estimate the pose of objects from sparse monocular views. This framework addresses the problem of dependency on CAD models and extends recent advancements in 3D Gaussian Splatting (3DGS), offering improvements on few-view reconstructions.
Overview
Monocular object pose estimation is crucial in tasks involving human-object interactions, such as robotic manipulation, augmented reality, and autonomous driving. The traditional approach often requires CAD models of the objects, complicating their applicability in real-world settings. Recent research has shifted towards category-level pose estimation to reduce dependency on specific models but such methods usually require extra depth information and falter with varying appearances.
SGPose mitigates these challenges by reconstructing object poses from as few as ten views. Starting with a random cuboid initialization, SGPose foregoes the need for Structure-from-Motion (SfM) derived geometry and CAD models. Instead, it regresses dense 2D-3D correspondences directly from the sparse inputs, using geometric-consistent depth supervision and online synthetic view warping to enhance performance. Experiments indicate that SGPose outperforms existing methods, particularly on challenging benchmarks like the Occlusion LM-O dataset.
Methodology
SGPose introduces several key innovations:
- Geometric-aware Depth Rendering: The framework formulates Gaussian primitives as elliptic disks to compute depth rendering efficiently. The alpha-blended depth map aggregates the depth values across Gaussian primitives, guiding synthetic view rendering and pruning efforts.
- Sparse View Object Reconstruction: Given only ten views, SGPose generates a high-quality geometric-aware depth representation, using geometric-consistency constraints to oversee the object-centric 3D reconstructions.
- Synthetic View Warping and Online Pruning: To combat overfitting under sparse views, SGPose implements synthetic view warping. The geometric-consistent constraints ensure that the depth map accurately represents the object scene, which subsequently enables reliable online pruning to minimize issues like floaters and background collapse.
- Dense 2D-3D Correspondences: By transforming the rendering depth into 3D points in camera coordinates and subsequently mapping them to world coordinates, SGPose produces dense 2D-3D correspondence maps essential for the monocular pose estimation task.
The loss function integrates image rendering, view warping, and geometric-consistent depth terms to maintain high fidelity in object reconstruction from sparse inputs. Notably, the method does not require CAD models, making it highly adaptable.
Results
Experiments demonstrate that SGPose achieves significant performance gains compared to both CAD-based and CAD-free methods, even when trained from significantly fewer images.
- LM Dataset: On the LM dataset, SGPose achieves performance on par with state-of-the-art methods using the Proj@5pix metric. Particularly noteworthy is its competitive performance using only ten views, indicating its efficiency and flexibility.
- Occlusion LM-O Dataset: For the more challenging Occlusion LM-O dataset, SGPose renders occlusion-rich synthetic views that significantly improve pose estimation accuracy. The framework successfully addresses heavy occlusions in the dataset, often outperforming even the best CAD-based methods.
The use of occluded images in training substantially enhances the model's ability to handle real-world scenarios involving heavy occlusions. This adaptation is a testament to its robust design and effective handling of sparse input conditions.
Implications
SGPose has both practical and theoretical implications. Practically, it offers a framework that can more readily be deployed in real-world applications where obtaining detailed CAD models is impractical. Theoretically, it pushes the boundaries of Gaussian-based methods and monocular pose estimation by demonstrating how sparse inputs can yield robust 3D reconstructions.
Speculative Developments: Future work could involve refining the training process to reduce overall time, enabling real-time online reconstruction and pose estimation. Additionally, expanding the framework to handle a broader range of object categories and integrating real-time depth sensors could further enhance its applicability and performance.
Conclusion
SGPose represents a significant advancement in monocular object pose estimation, effectively replacing the need for CAD models with a Gaussian-based approach capable of handling sparse views. By leveraging geometric-aware depth rendering, synthetic view warping, and online pruning, the framework achieves high accuracy in both standard and occlusion-heavy datasets, setting a new benchmark for CAD-free pose estimation methods.