GS2POSE: 6D Pose Estimation with 3D Gaussian Splatting
- The paper introduces GS2POSE, a 6D object pose estimation framework that integrates 3D Gaussian Splatting with a differentiable, iterative pose regression algorithm.
- It employs a coarse pose estimation via a NOCS-based Pose-Unet and refines pose using Lie algebra-driven updates to minimize reprojection errors.
- GS2POSE adapts to textureless objects and variable illumination, achieving improved accuracy on benchmarks like T-LESS, LineMod-Occlusion, and LineMod.
GS2POSE is a two-stage 6D object pose estimation framework that integrates 3D Gaussian Splatting (3DGS) with an iterative, differentiable rendering-based pose regression algorithm, specifically designed to improve accuracy and robustness in scenarios with textureless objects and significant illumination variation. The method is structured to eliminate reliance on high-quality CAD models, instead requiring only segmented RGBD images and facilitating adaptability to environmental factors. The core innovation lies in marrying 3DGS for generative modeling and differentiable rendering with a Bundle Adjustment–inspired regression scheme, using Lie algebra to enable pose-differentiable updates and color adaptivity. GS2POSE achieves accuracy improvements over prior art on standard benchmarks such as T-LESS, LineMod-Occlusion, and LineMod, highlighting its potential for industrial and challenging vision applications (Li et al., 19 Oct 2025).
1. Motivation and Problem Context
Traditional 6D object pose estimation pipelines usually establish correspondences between 2D image features and 3D model features, often employing CAD models as the 3D representation. This strategy fails for textureless objects and renders the system brittle to variable or extreme illumination, since the lack of distinctive local features inhibits robust matching, and appearance-based cues become unreliable. The objective of GS2POSE is to overcome these limitations by leveraging 3DGS, which offers explicit, continuous scene modeling with per-Gaussian adaptivity, and to develop a pose regression scheme that does not require richly textured objects or photometric constancy.
2. System Architecture and Methodological Overview
The GS2POSE architecture consists of two key stages:
- Coarse Pose Estimation The first stage uses a specialized convolutional network, Pose-Unet, to map the input RGB image and its object mask to a Normalized Object Coordinate Space (NOCS) map. This NOCS image represents a per-pixel prediction of object surface coordinates in a canonical frame, trained to mimic reference views rendered from the 3DGS model. The backbone is a ResNet50 encoder enhanced by a Multi-Scale Feature Enhancement Module (MFEM) and polarized self-attention to robustly extract structural and contour information. A PnP algorithm, combined with RANSAC, solves for a coarse object-to-camera transformation using sufficiently bright NOCS pixels as correspondences.
- Pose Refinement Stage The second stage, called GS-refiner, implements an iterative, differentiable pipeline to minimize the reprojection error between the rendered image from the 3DGS model (under the current pose hypothesis) and the observed RGBD input. The 3DGS model describes the object as a set of oriented Gaussians, each parameterized by mean position , anisotropic covariance , opacity , and spherical harmonic coefficients (for appearance/color). The density function for each Gaussian is . Rendering is fully differentiable, allowing gradients to be computed with respect to both the pose and Gaussian parameters.
3. Iterative Pose Regression and Lie Algebraic Updates
The pose refinement procedure models pose increments in the Lie algebra , decomposing the update into camera and object refiners:
- Camera Refiner: Translates and rotates the camera in the world/object frame, using left-multiplicative perturbations. The new camera pose is expressed as , such that .
- Object Refiner: Rotates the object in the camera frame by a right-multiplicative perturbation , updating object point positions as .
This dual approach enables tight control over translation and rotation error. The optimization objective is a composite loss: where is the L1 loss over pixels, is the structural similarity index, and is the depth consistency term computed via point cloud alignment (GS-ICP) between the rendered and observed scenes.
4. Adaptation to Illumination via Color Parameter Updates
To address changing illumination, GS2POSE selectively updates the color (appearance) parameters in the 3DGS model during refinement. Colors are modeled as spherical harmonics, with coefficients in the expansion: where are harmonic basis functions. During optimization, only the color coefficients are adjusted (higher-order spherical harmonic parameters and rotations), while the spatial parameters of the Gaussians remain locked. This separation allows the appearance of the rendered object to adapt to current lighting and mild occlusion conditions, improving robustness to real-world scenarios with variable ambient lighting.
5. Depth Correction and Use of Contour Information
Depth blur, which is a common outcome of imperfect sensor data or inaccurate NOCS extraction, is addressed via GS-ICP point cloud alignment. A target point set is obtained from the observed RGBD frame, and a ray-projected point set is generated from the 3DGS model under the current pose. An ICP algorithm iteratively refines the transformation until the depth alignment reaches minimal error, providing a corrected z-position that is especially critical for small or thin objects where pixelwise depth cues are unreliable. Additionally, the training of the Pose-Unet explicitly focuses on contour and color rather than fine-grained texture features, which are inherently absent in the textureless regime.
6. Experimental Evaluation
Empirical analysis on three standard benchmarks confirms the effectiveness of GS2POSE:
| Dataset | GS2POSE Accuracy Improvement | Benchmark Metric |
|---|---|---|
| T-LESS | +1.4% | ADD(S) Success Rate |
| LineMod-Occlusion | +2.8% | ADD(S) |
| LineMod | +2.5% | ADD(S) |
GS2POSE is particularly effective on small or contour-dominated objects, outperforming SSD-6D, Pix2Pose, DPOD, CDPN, and NeRF-Pose (Li et al., 19 Oct 2025). The gains are attributable to its hybrid geometric-appearance modeling and iterative refinement combining depth, contour, and color cues.
7. Implications, Limitations, and Future Directions
The central implication of GS2POSE is its ability to move beyond CAD-model-based, texture-dependent paradigms for pose estimation, making real-time, robust estimation feasible even in challenging industrial or robotics settings with adverse lighting or minimally structured surfaces. The explicit, adaptive modeling of both geometry and color in 3DGS lends itself to further extensions, such as faster convergence, generalization to unseen objects, and support for category-level (not just instance-level) pose estimation.
The current method is limited by its reliance on iterative optimization, which incurs computational overhead and restricts throughput, as well as by its instance-level focus. Future work is expected to address one-shot pose estimation, stronger generalization to novel categories, and increased efficiency. Additionally, harmonizing GS2POSE with large-scale pretrained models or foundation representations could further enhance its adaptability and domain transfer capabilities.
In summary, GS2POSE advances 6D object pose estimation for textureless or variably illuminated objects by combining a coarse NOCS-based estimator with an iterative, Lie algebra-driven pose regression pipeline using differentiable 3D Gaussian Splatting. It demonstrates measurable improvements over previous methods on public datasets, robust handling of depth and lighting artifacts, and flexibility suitable for realistic deployment scenarios (Li et al., 19 Oct 2025).