Beyond 'Templates': Category-Agnostic Object Pose, Size, and Shape Estimation from a Single View

Published 13 Oct 2025 in cs.CV | (2510.11687v1)

Abstract: Estimating an object's 6D pose, size, and shape from visual input is a fundamental problem in computer vision, with critical applications in robotic grasping and manipulation. Existing methods either rely on object-specific priors such as CAD models or templates, or suffer from limited generalization across categories due to pose-shape entanglement and multi-stage pipelines. In this work, we propose a unified, category-agnostic framework that simultaneously predicts 6D pose, size, and dense shape from a single RGB-D image, without requiring templates, CAD models, or category labels at test time. Our model fuses dense 2D features from vision foundation models with partial 3D point clouds using a Transformer encoder enhanced by a Mixture-of-Experts, and employs parallel decoders for pose-size estimation and shape reconstruction, achieving real-time inference at 28 FPS. Trained solely on synthetic data from 149 categories in the SOPE dataset, our framework is evaluated on four diverse benchmarks SOPE, ROPE, ObjaversePose, and HANDAL, spanning over 300 categories. It achieves state-of-the-art accuracy on seen categories while demonstrating remarkably strong zero-shot generalization to unseen real-world objects, establishing a new standard for open-set 6D understanding in robotics and embodied AI.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a unified, regression-based framework that estimates an object's 6D pose, size, and dense shape from a single RGB-D image without relying on templates or CAD models.
It leverages dense semantic features from a vision foundation model fused with segmented 3D point clouds, using a Transformer with Mixture-of-Experts to achieve real-time inference at 28 FPS.
Experiments on synthetic and real-world datasets demonstrate state-of-the-art performance, robust generalization to unseen categories, and superior Chamfer-L1 shape reconstruction.

Category-Agnostic 6D Object Pose, Size, and Shape Estimation from a Single View

Introduction and Motivation

This work addresses the longstanding challenge of estimating an object's 6D pose, size, and shape from a single RGB-D image in a category-agnostic manner, without reliance on templates, CAD models, or category labels at inference. The proposed framework is motivated by the need for scalable, real-time, and generalizable solutions in robotics and embodied AI, where open-set environments and novel object categories are the norm. Prior approaches either depend on object-specific priors or suffer from limited generalization due to pose–shape entanglement and multi-stage pipelines. This paper introduces a unified, regression-based architecture that fuses dense 2D semantic features from vision foundation models with partial 3D point clouds, processed via a Transformer encoder augmented by a Mixture-of-Experts (MoE) module. The model jointly predicts 6D pose, size, and dense shape in a single forward pass, achieving real-time inference at 28 FPS.

Figure 1: Results on diverse domain datasets using the end-to-end regression-based framework, demonstrating strong generalization to unseen object categories across multiple real-world domains.

Methodology

Feature Extraction and Fusion

The framework leverages RADIOv2.5, a vision foundation model, to extract dense, SE(3)-consistent semantic features from RGB input. These features are fused point-wise with 3D coordinates from the segmented point cloud, following DenseFusion principles. The fused representation is processed by a DGCNN encoder, yielding keypoint-aware object tokens.

Transformer Encoding with Mixture-of-Experts

Object tokens are passed through a stack of Transformer blocks, where standard feed-forward layers are replaced by MoE modules. Each MoE layer conditionally activates a subset of expert networks, enabling scalable specialization across diverse shape distributions. This design enhances modeling capacity and inference efficiency, supporting robust generalization to novel object categories.

Multi-Task Decoding Heads

The global object representation is decoded via parallel branches:

Pose and Size Estimation: Direct regression of rotation (using a continuous 6D representation for stability), translation, and scale.
Shape Reconstruction: Coarse shape prediction via MLP, followed by confidence-guided fusion with the input point cloud and local folding for dense reconstruction.
Figure 2: Framework overview, illustrating the fusion of cropped RGB image features and segmented point cloud, followed by DGCNN, Transformer+MoE, and parallel decoders for pose–size and shape.

Loss Functions

The training objective integrates:

Chamfer Distance (L1) for coarse and dense shape reconstruction.
Smooth L1 loss for pose and size regression, with symmetry-aware rotation loss to handle ambiguous ground truths.
All loss coefficients are set to 1, yielding stable training without the need for manual balancing.

Experimental Results

Benchmarks and Metrics

The model is trained exclusively on synthetic SOPE data (149 categories) and evaluated on SOPE, ROPE, ObjaversePose, and HANDAL, spanning 300+ categories and synthetic-to-real transfer. Metrics include AUC of 3D bounding box IoU, VUS (joint rotation/translation thresholds), mean rotation/translation errors, and Chamfer-L1 for shape fidelity.

Quantitative Performance

SOPE (Synthetic, Seen Categories): Achieves 56.4 AUC@IoU25, outperforming GenPose++ (50.1), with lower mean rotation/translation errors and higher VUS.
ROPE (Real, Seen Categories): 44.9 AUC@IoU25, surpassing all baselines, including reference-based methods.
ObjaversePose (Synthetic, Unseen Categories): 42.2 AUC@IoU25 at 0% occlusion, nearly double GenPose++ (21.3), with robust performance under severe occlusion.
HANDAL (Real, Unseen Categories): 33.0 AUC@IoU25, outperforming Any6D and GenPose++ in reference-free settings.
Figure 3: Qualitative results on ROPE, comparing input, ground-truth, GenPose++ predictions, and the proposed model’s outputs for pose and shape.

Shape Reconstruction

On SOPE, the model achieves a Chamfer-L1 error of 5.93×10⁻³, significantly lower than AdaPointr (24.41×10⁻³) and Pointr (29.87×10⁻³), demonstrating superior shape completion.

Ablation Studies

RGB–Depth Fusion: Removing RADIO features degrades performance, confirming the necessity of dense semantic cues.
MoE Module: Standard Transformer feed-forward layers yield lower accuracy and slower inference, validating expert specialization.
Shape Supervision: Eliminating the shape branch reduces pose accuracy and slows convergence.
Coarse-to-Fine and Selection: Direct dense prediction or omitting point selection increases Chamfer-L1 and reduces AUC.
Figure 4: Qualitative examples from ROPE, highlighting robust shape and pose predictions under challenging conditions.

Implementation Details

Backbone: Frozen RADIOv2.5-l for RGB features; DGCNN for point cloud encoding.
Transformer: Geometry-aware attention in the initial layer; MoE with 2-in-8 expert activation for efficiency.
Decoders: MLPs for pose, size, and shape; local folding for dense point cloud generation.
Training: 50 epochs, batch size 128, AdamW optimizer, LambdaLR scheduler, AMP, and DDP on 4× RTX 4080 GPUs.

Practical and Theoretical Implications

The proposed framework demonstrates that category-agnostic, reference-free 6D estimation is feasible and effective, even when trained solely on synthetic data. The integration of foundation model features and MoE scaling enables strong generalization and real-time inference, making the approach suitable for deployment in robotics, AR, and embodied AI systems. The unified pipeline avoids error accumulation typical of multi-stage methods and supports robust shape completion under occlusion and domain shift.

Limitations and Future Directions

Performance is bounded by the diversity of training categories and may degrade on long-tail or atypical shapes. Fine-grained details and articulated/deformable objects are not fully addressed. Future work should focus on scaling to richer corpora (e.g., ObjaverseXL), advancing articulation/deformation modeling, and extending toward open-world, task-driven 6D understanding.

Conclusion

This paper presents a unified, category-agnostic framework for joint 6D pose, size, and shape estimation from a single RGB-D image, achieving state-of-the-art accuracy and strong generalization to unseen categories. The approach eliminates the need for templates, CAD models, or category labels at inference, supporting scalable and robust deployment in real-world applications. The results establish a new standard for open-set 6D understanding in robotics and embodied AI.

Markdown