Papers
Topics
Authors
Recent
Search
2000 character limit reached

PoseCNN: 6D Pose Estimation

Updated 14 January 2026
  • PoseCNN is a convolutional neural network architecture designed for robust 6D pose estimation from monocular RGB input, effectively addressing clutter, occlusion, and object symmetries.
  • It employs a dual-stage design with a shared VGG16-style feature extractor and three specialized prediction heads for segmentation, translation, and quaternion-based rotation regression.
  • Innovative loss functions, including SLoss to handle symmetric object ambiguities, and extensive evaluation on the YCB-Video dataset demonstrate its state-of-the-art performance.

PoseCNN is a convolutional neural network architecture developed for 6D object pose estimation from monocular RGB input, targeting robustness against clutter and occlusion and explicit treatment of object symmetries. It introduced a decoupled treatment of translation and rotation, novel loss functions suited for symmetric objects, and a large-scale real-world RGB-D dataset, the YCB-Video dataset, for evaluation. Extensions such as ConvPoseCNN2 further densify orientation prediction and introduce feature-refinement and aggregation strategies, yielding efficiency and accuracy benefits (Xiang et al., 2017, Periyasamy et al., 2022).

1. Architectural Principles

PoseCNN's architecture is organized into two principal stages: a shared feature extractor and three parallel, task-specific heads for semantic segmentation, object center localization and translation, and rotation regression.

The shared backbone is a VGG16-style network generating feature maps at multiple resolutions—denoted F1/8RH/8×W/8×512F_{1/8}\in\mathbb R^{H/8\times W/8\times512} and F1/16RH/16×W/16×512F_{1/16}\in\mathbb R^{H/16\times W/16\times512}—which serve as input to the specialized branches. The three prediction heads operate as follows:

  1. Semantic Segmentation Head: Uses convolutional and deconvolutional layers to produce per-pixel category probabilities, yielding a mask for each object.
  2. Translation Head: Regresses, for each pixel classified to a given object, the 2D direction vector to the object's projected center and its depth from the camera. At inference, Hough voting accumulates these per-pixel predictions, and peaks in the resulting score map locate object centers. Predicted depths for inlier pixels yield TzT_z, with object translation T=(Tx,Ty,Tz)T=(T_x,T_y,T_z)^\top estimated via the pinhole camera model.
  3. Rotation Head: Pools features from the region defined by inlier object pixels, and a sequence of fully connected layers regresses a quaternion q^\hat q representing the object's 3D rotation. Normalization onto the unit 3-sphere S3S^3 is enforced to ensure valid rotations.

In summary, PoseCNN exploits pixelwise cues for both segmentation and geometry, and combines them via voting and region pooling to produce robust 6D pose hypotheses (Xiang et al., 2017).

2. Loss Functions and Symmetry Handling

PoseCNN decomposes the total loss into three major components: Ltotal=λsegLseg+λcenterLcenter+λrotLrotL_{\text{total}} = \lambda_{\text{seg}}L_{\text{seg}} + \lambda_{\text{center}}L_{\text{center}} + \lambda_{\text{rot}}L_{\text{rot}}

  • Semantic Loss: Per-pixel cross-entropy,

Lseg=1PpPk=1nyp,klogy^p,kL_{\text{seg}} = -\frac{1}{|P|} \sum_{p\in P} \sum_{k=1}^{n} y_{p,k} \log \hat{y}_{p,k}

  • Center and Depth Regression: Smoothed L1 loss for direction and normalized depth per pixel,

Lcenter=1PpP[smoothL1(n^xnx)+smoothL1(n^yny)+smoothL1(T^zTz)]L_{\text{center}} = \frac{1}{|P|} \sum_{p\in P} \left[ \text{smooth}_{L1}(\hat n_x-n_x) + \text{smooth}_{L1}(\hat n_y-n_y) + \text{smooth}_{L1}(\hat T_z-T_z)\right]

  • Rotation Losses: For 3D orientation, two options exist:

    • Pose Loss (PLoss): Squared difference of mesh points transformed by predicted and ground-truth rotation.

    PLoss(q~,q)=12mxMR(q~)xR(q)x2\text{PLoss}(\tilde{q},q) = \frac{1}{2m}\sum_{x\in M} \|R(\tilde{q})x - R(q)x\|^2 - ShapeMatch Loss (SLoss): Modified to account for object symmetries by matching each point to its closest, under 3D shape.

    SLoss(q~,q)=12mx1Mminx2MR(q~)x1R(q)x22\text{SLoss}(\tilde{q},q) = \frac{1}{2m}\sum_{x_1\in M} \min_{x_2\in M} \|R(\tilde{q})x_1 - R(q)x_2\|^2

SLoss drives the network toward any equivalent orientation consistent with physical symmetry, addressing a key challenge in 6D object pose estimation (Xiang et al., 2017).

3. Quaternion-Based Rotation Regression

Rotations are represented with quaternions q=(qw,qx,qy,qz)q=(q_w,q_x,q_y,q_z)^\top, with explicit normalization at inference: q^q^q^\hat{q} \leftarrow \frac{\hat{q}}{\|\hat{q}\|} Rotation regression employs either PLoss or SLoss. Networks trained with SLoss exhibit pronounced error peaks at symmetry angles, indicating correct handling of ambiguous orientation assignments for symmetric objects.

An ablation study demonstrates that SLoss improves ADD-S AUC by 12–18 percentage points for symmetric objects versus PLoss, which penalizes the model for predicting alternative but valid symmetric orientations (Xiang et al., 2017).

4. Evaluation Datasets and Metrics

The YCB-Video dataset, introduced with PoseCNN, is a large-scale RGB-D collection comprising 92 videos (640×480, 30 Hz) of 21 household objects, totaling 133,827 frames and annotated for 6D object pose. Annotation quality is assured via SDF-model alignment and global bundle adjustment after manual initialization. This dataset presents a significantly larger and more challenging benchmark than LINEMOD due to real clutter and occlusion levels (Xiang et al., 2017).

Evaluation employs:

  • ADD: Average Euclidean distance between model points transformed by ground-truth vs. predicted pose.
  • ADD-S: Closest point average for symmetric objects.
  • AUC: Area under the accuracy-threshold curve for ADD/ADD-S.
  • Reprojection error: Percentage of poses within 2 px in OccludedLINEMOD.

5. Experimental Results

Quantitative results on YCB-Video and OccludedLINEMOD demonstrate that PoseCNN achieves state-of-the-art performance in monocular 6D pose estimation with significant robustness to occlusion and symmetry:

Method AUC (ADD/ADD-S) Translation Error (m)
PoseCNN (RGB only) 0.45
PoseCNN + ICP (RGB-D) 0.68
Coord-Reg RANSAC (baseline) 0.25
BB8 (2017, OccludedLINEMOD) 47% (ADD ≤ 0.1d)
ICP-RANSAC (2015, OccludedLINEMOD) 55%
PoseCNN (PLoss) + ICP (OccludedLINEMOD) 76%
PoseCNN (SLoss) + ICP (OccludedLINEMOD) 83%

ADD/ADD-S scores are computed up to a 10 cm threshold and averaged over all test objects. ICP-based depth refinement yields further accuracy improvements over color-only predictions.

SLoss leads to accurate predictions for symmetric/ambiguous shapes e.g., wood block, clamp, while PLoss distributes errors uniformly over all symmetrically-equivalent orientations.

Qualitative results highlight PoseCNN's ability to localize object centers and rotations even under severe occlusion or for textureless items (Xiang et al., 2017).

6. Extensions: Fully Convolutional Orientation, Aggregation, and Refinement

ConvPoseCNN2 extends the original by replacing the RoI-pooled, fully-connected rotation head with a fully-convolutional branch producing per-pixel quaternion predictions (Periyasamy et al., 2022). Dense pose predictions are aggregated via:

  • Weighted Quaternion Averaging: Principal eigenvector of iwiqiqi\sum_i w_i q_i q_i^\top.
  • Weighted RANSAC Clustering: Angular clustering of quaternions based on confidence.

A feature refinement module applies residual updates between VGG backbone and task heads, improving 6D accuracy and reducing model size and training time. Key quantitative differences on YCB-Video include:

Model AUC_P AUC_S Rot_P Rot_S Trans [m] IoU Model Size Train it/s
PoseCNN (reimpl.) 53.29 78.31 69.00 90.49 0.0465 0.807 1.1 GiB 1.18
ConvPoseCNN2 (L2) 57.42 79.26 74.53 91.56 0.0411 0.804 309 MiB 2.09

Pre-prediction residual refinement (T=3 iterations) yields a peak ADD-S AUC of 80.6%. RANSAC-based aggregation offers better symmetry handling at the cost of higher runtime (PoseCNN: 141.7 ms/frame; ConvPoseCNN2 + W-RANSAC: 564 ms/frame) (Periyasamy et al., 2022).

Failures are concentrated on highly symmetric or textureless objects, for which symmetry-induced ambiguities or lack of local appearance cues lead to uncertainty.

7. Training Regimen and Implementation

PoseCNN is trained on TensorFlow with a custom CUDA Hough voting layer and VGG16-initialized weights for convolutional and fully-connected layers. The full procedure incorporates 80,000 synthetic images for augmentation (objects rendered on random backgrounds under diverse illumination). Optimization employs SGD with momentum (0.9), a learning rate of 10310^{-3} decayed at fixed steps, and balanced loss weights λseg=1.0\lambda_{\text{seg}}=1.0, λcenter=1.0\lambda_{\text{center}}=1.0, λrot=0.3\lambda_{\text{rot}}=0.3, with batch size 16 distributed over two GPUs (Xiang et al., 2017).

ConvPoseCNN2 introduces aggregated orientation head and feature refinement in a memory- and speed-optimized manner, training at ≈2 iterations/s and 309 MiB model size compared to ≈1.18 iterations/s and 1.1 GiB for PoseCNN. Loss heads and aggregation methods are tuned via ablation for optimal AUC-P and AUC-S on YCB-Video (Periyasamy et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PoseCNN.