- The paper introduces a novel topological prior that encodes SE(3)-invariant global context for robust category-level 6D pose estimation.
- It leverages a Hybrid Graph Fusion module combining transformer-based topological aggregation with local geometric descriptors to boost accuracy.
- Empirical results on REAL275 and CAMERA25 datasets demonstrate significant performance gains over state-of-the-art methods under challenging occlusion and intra-class variation.
Summary of THE-Pose: Topological Prior with Hybrid Graph Fusion for Estimating Category-Level 6D Object Pose
Introduction and Motivation
The paper addresses the persistent problem of category-level 6D object pose estimation where the objective is to determine the 3D location, rotation, and size of objects from a single RGB-D input. Prior instance-level approaches rely on object-specific CAD models, making them unsuitable for situations involving novel objects and significant intra-class variation. Categorical methods, instead, seek generalization but struggle to capture both global semantic context and local geometric structure, especially in the presence of complex shape deformations and occlusion artifacts.
Traditional reliance on 3D graph convolution (3D-GC) provides only localized geometric features and is limited in handling global context, leading to degraded performance for category-level regression. Existing shape and semantic priors, derived either from mean shape models or pretrained foundation models, introduce confounding visual information and are ill-suited for category-level consistency, exhibiting performance drops under high intra-class variation and texture change.
Methodology
Topological Prior Design
THE-Pose introduces the concept of a ‘topological prior,’ learned via category-level surface embedding, that encodes only the minimal geometric and topological information essential for robust global context. Crucially, this prior is engineered to be SE(3)-invariant, ensuring that global geometric consistency is preserved regardless of the object's external orientation or translation.
Unlike mean-shape priors, this representation exploits the mapping established by a retrained SurfEmb backbone, generalizing from object-level to category-level through generation and exposure to extensive synthetic data. The resulting embedding is both topologically and geometrically consistent across the category, and robust to color, texture, and irrelevant semantic confounders.
Hybrid Graph Fusion (HGF) Module
The extracted topological feature maps are processed in a dedicated Topological Global Context (TGC) aggregator, utilizing a transformer (STViT) to efficiently capture both fine-grained and contextually-global features. These are then hybridized with local geometric descriptors in the Hybrid Graph Fusion module.
Hybrid receptive fields in HGF blend both point-level (microscopic) and feature-level (macroscopic) distances, controlled via an adaptive hyperparameter α, which weights the contribution of pure geometry versus topological context depending on the scenario. A two-path approach per fusion layer: one path with positional encoding for direct size/translation awareness and another with weighted hybrid graph convolution, maintains robustness to local noise while ensuring global consistency. Hierarchical stacking and pooling further enrich the representation.
Pose and Size Regression
The final fused descriptor concatenates TGC and HGF outputs and feeds directly into parallel MLP heads that decouple the regression of rotation, translation, and scale. The regression framework follows the guidelines of recent decoupled rotation strategies. Training integrates both pose regression losses and symmetric-aware point cloud reconstruction, but inference is “encoder only” for run-time efficiency.
Experimental Evaluation
Datasets and Evaluation Metrics
THE-Pose is evaluated on both REAL275 (real-world, six categories, significant intra-class variation) and CAMERA25 (synthetic, large-scale, high variability) datasets, following prevailing protocol. Key metrics are mAP at multiple IoU thresholds (50%, 75%), and thresholds on rotation (5∘, 10∘) and translation (2 cm, 5 cm).
SOTA Comparison
THE-Pose demonstrates consistent improvements across all key metrics and categories in head-to-head comparisons with state-of-the-art baselines, including prior- and keypoint-based, semantic, and competitive 3D-GC pipelines like HS-Pose and SecondPose. On REAL275, THE-Pose reports a 35.8% performance lift over the 3D-GC baseline HS-Pose and a 7.2% improvement over the previous best across all principal metrics. On CAMERA25, it achieves the highest scores across the board, with enhancements of up to 4.3% on strict pose metrics over AG-Pose.
Ablation Studies and Analysis
Ablation experiments isolate the contribution of the TGC, HGF, and positional encoding modules. The integration of both TGC and HGF into the 3D-GC baseline delivers synergistic gains, e.g., an increase from 46.5% to 56.4% on the strictest 5°, 2cm metric. Positional encoding is shown to further improve fine-grain accuracy, and careful tuning of the hybrid distance weighting factor (α=0.7–$0.8$) yields optimal performance, balancing local and global sensitivity.
Evaluation under controlled occlusion demonstrates performance degradation is modest even with up to 40% occlusion, substantiating the robustness claimed for the hybrid fusion architecture.
Per-category results further corroborate the method’s global applicability, with the largest category-level improvements on objects subject to high intra-class variation and symmetry ambiguities.
Theoretical and Practical Implications
THE-Pose conclusively demonstrates that the direct, adaptive fusion of SE(3)-invariant topological context with point-based local geometric reasoning significantly advances category-level 6D pose estimation. The proposed topological prior outperforms both mean-shape priors and general semantic priors from vision foundation models, particularly in scenarios involving partial occlusion, texture changes, and symmetric ambiguities.
The hybrid graph fusion framework provides a robust, extensible template for future combinatorial architectures, in which both image-derived contextual and point cloud-based geometric representations can be exploited adaptively for more effective scene and object-level reasoning.
Future Work
Prospective directions include tighter integration with tracking and instance segmentation paradigms, exploring online adaptation of the weighting parameter for changing environments, and scaling to more granular or cross-category open-set regimes. Given the performance under occlusion, extensions into multi-view or temporal fusion for heightened robustness are promising.
Conclusion
THE-Pose sets a new benchmark for category-level 6D pose estimation by harmonizing global SE(3)-invariant topological priors with local point cloud geometry via a hierarchical hybrid graph fusion module. The approach achieves strong empirical improvements over prior categorical SOTA, generalizes robustly across object class, and remains resilient to severe occlusion and intra-class variation. This architecture offers an effective strategy for bridging the gap between pure geometry and contextual semantics in pose estimation.
For full details and reproducibility, code and models are made available by the authors (2512.10251).