Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation (1908.07433v1)

Published 20 Aug 2019 in cs.CV

Abstract: Estimating the 6D pose of objects using only RGB images remains challenging because of problems such as occlusion and symmetries. It is also difficult to construct 3D models with precise texture without expert knowledge or specialized scanning devices. To address these problems, we propose a novel pose estimation method, Pix2Pose, that predicts the 3D coordinates of each object pixel without textured models. An auto-encoder architecture is designed to estimate the 3D coordinates and expected errors per pixel. These pixel-wise predictions are then used in multiple stages to form 2D-3D correspondences to directly compute poses with the PnP algorithm with RANSAC iterations. Our method is robust to occlusion by leveraging recent achievements in generative adversarial training to precisely recover occluded parts. Furthermore, a novel loss function, the transformer loss, is proposed to handle symmetric objects by guiding predictions to the closest symmetric pose. Evaluations on three different benchmark datasets containing symmetric and occluded objects show our method outperforms the state of the art using only RGB images.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kiru Park (4 papers)
  2. Timothy Patten (13 papers)
  3. Markus Vincze (46 papers)
Citations (417)

Summary

This paper introduces Pix2Pose, a method for estimating the 6D pose (3D rotation and 3D translation) of known objects from a single RGB image. The core idea is to train a neural network to directly regress the 3D coordinates of each visible object pixel in the object's canonical coordinate frame. This approach avoids the need for textured 3D models during training, addressing a common limitation where high-quality textured models are unavailable or difficult to create, such as with industrial CAD models.

Key Features and Implementation Details

  1. Pixel-Wise Coordinate Regression:
    • Instead of predicting pose parameters directly or matching features, Pix2Pose predicts a "coordinate image" (I3DI_{3D}) where each pixel (u,v)(u, v) corresponding to the object contains the normalized (x,y,z)(x, y, z) coordinates of that point on the object's surface.
    • The ground truth coordinate images for training are generated by rendering the object's 3D model from the known pose, where the RGB color of each vertex/pixel is mapped directly from its normalized (x,y,z)(x, y, z) coordinates (e.g., xR,yG,zBx \to R, y \to G, z \to B). See Figure 1 in the paper.
  2. Network Architecture:
    • An auto-encoder architecture, similar to U-Net, is used. It takes a 128x128 RGB image patch (cropped around the detected object) as input.
    • The encoder uses convolutional layers to extract features.
    • Skip connections are used between the encoder and decoder layers to preserve fine spatial details, improving accuracy near boundaries.
    • The decoder uses deconvolutional layers to reconstruct the spatial resolution.
    • The network has two output heads:
      • One outputs the 3-channel coordinate image I3DI_{3D} (using tanh activation).
      • The other outputs a 1-channel error prediction map IeI_e (using sigmoid activation), estimating the expected L1 error for each pixel's coordinate prediction.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Simplified Pseudocode for Architecture
def Pix2PoseNetwork(input_image):
    # Encoder Path (Convolutional Layers)
    e1 = ConvBlock(input_image, filters=..., kernel_size=5, stride=2) # 128 -> 64
    e2 = ConvBlock(e1, filters=..., kernel_size=5, stride=2) # 64 -> 32
    e3 = ConvBlock(e2, filters=..., kernel_size=5, stride=2) # 32 -> 16
    e4 = ConvBlock(e3, filters=..., kernel_size=5, stride=2) # 16 -> 8

    # Bottleneck (Fully Connected Layers)
    bottleneck = Flatten(e4)
    bottleneck = Dense(bottleneck, units=256)
    bottleneck = Reshape(bottleneck, target_shape=...) # Reshape back to spatial dims for decoder

    # Decoder Path (Deconvolutional Layers + Skip Connections)
    d1 = DeconvBlock(bottleneck, filters=..., kernel_size=5, stride=2) # 8 -> 16
    d1 = Concatenate([d1, e3]) # Skip connection
    d2 = DeconvBlock(d1, filters=..., kernel_size=5, stride=2) # 16 -> 32
    d2 = Concatenate([d2, e2]) # Skip connection
    d3 = DeconvBlock(d2, filters=..., kernel_size=5, stride=2) # 32 -> 64
    d3 = Concatenate([d3, e1]) # Skip connection
    d4 = DeconvBlock(d3, filters=..., kernel_size=5, stride=2) # 64 -> 128

    # Output Heads
    coord_output = ConvLayer(d4, filters=3, kernel_size=5, stride=1, activation='tanh') # I_3D
    error_output = ConvLayer(d4, filters=1, kernel_size=5, stride=1, activation='sigmoid') # I_e

    return coord_output, error_output

  1. Training and Loss Functions:
    • Data Augmentation: Training uses real images with augmentations: objects are pasted onto random backgrounds (COCO dataset), color jittering, blurring boundaries, simulated occlusion (removing parts of the object), and random in-plane rotations.
    • Reconstruction Loss (Lr\mathcal{L}_r): A basic L1 loss between the predicted coordinate image I3DI_{3D} and the ground truth IgtI_{gt}. Errors on object pixels are weighted higher (by factor β\beta) than background pixels.
    • Transformer Loss (L3D\mathcal{L}_{3D}): A novel loss to handle symmetric objects. For objects with known discrete symmetries (e.g., 180-degree rotation for a box), the loss calculates the reconstruction error between the prediction I3DI_{3D} and the ground truth transformed into each possible symmetric pose. The minimum error among these poses is used for backpropagation.

      L3D=minpsymLr(I3D,RpIgt)\mathcal{L}_\textrm{3D}= \min_{p \in \textrm{sym}} \mathcal{L}_\textrm{r}(I_\textrm{3D}, R_{p}I_{gt})

      where symsym is the set of symmetry transformations (including identity) and RpR_p is the 3x3 rotation matrix for symmetry pp. This guides the network towards the closest valid symmetric pose without needing predefined view limits.

    • Error Prediction Loss (Le\mathcal{L}_e): An L2 loss that trains the network to predict the actual pixel-wise L1 coordinate error: Le=1niIeimin(I3DiIgti1,1)22\mathcal{L}_\textrm{e} =\frac{1}{n} \sum_{i}||I_\textrm{e}^{i} - \textrm{min}(||I_\textrm{3D}^i-I_\textrm{gt}^i||_1, 1)||^2_2.

    • GAN Loss (LGAN\mathcal{L}_{GAN}): A Generative Adversarial Network framework is optionally used. A discriminator network tries to distinguish between real (rendered) coordinate images and predicted ones. This encourages the generator (Pix2Pose network) to produce more realistic coordinate maps, particularly helping to "inpaint" coordinates for occluded regions realistically.

    • Combined Loss: The final objective combines these losses: Ltotal=LGAN+λ1L3D+λ2Le\mathcal{L}_\textrm{total} = \mathcal{L}_\textrm{GAN} + \lambda_1 \mathcal{L}_\textrm{3D} + \lambda_2 \mathcal{L}_\textrm{e}. The ablation paper shows GAN training significantly improves robustness to occlusion.

  2. Two-Stage Pose Prediction:
    • Input: An initial bounding box from a 2D object detector (e.g., Faster R-CNN, RetinaNet). The crop is padded (1.5x larger) to include context and potentially occluded areas.
    • Stage 1 (Mask Refinement & Bbox Adjustment):

    1. Run the network on the initial padded crop. 2. Generate a refined mask: Keep pixels where the predicted coordinate magnitude I3D2||I_{3D}||_2 is non-zero AND the predicted error IeI_e is below an outlier threshold θo\theta_o. 3. Calculate the centroid of this refined mask. 4. Create a new, tighter bounding box centered on this centroid. 5. Generate a refined input image by cropping using the new box and masking out pixels not in the refined mask (setting them to black/zero). * Stage 2 (Final Pose Estimation):

    1. Run the network again on the refined input image from Stage 1. 2. Generate 2D-3D correspondences: For each pixel (u,v)(u, v) where the predicted coordinate I3D(u,v)I_{3D}(u, v) is non-zero and the predicted error Ie(u,v)I_e(u, v) is below an inlier threshold θi\theta_i, create a correspondence between the 2D pixel coordinate (u,v)(u, v) and the predicted 3D coordinate I3D(u,v)I_{3D}(u, v). 3. Solve for the 6D pose using the Perspective-n-Point (PnP) algorithm (specifically EPnP) with RANSAC on these 2D-3D correspondences. RANSAC helps reject remaining outlier correspondences based on reprojection error (threshold θre\theta_{re}).

  3. Practical Advantages:

    • No Textured Models: Works with untextured CAD models or geometry-only scans.
    • Occlusion Robustness: The GAN training helps predict coordinates for occluded parts. The PnP+RANSAC step uses only reliable visible points.
    • Symmetry Handling: The transformer loss provides a principled way to handle discrete symmetries without complex viewpoint handling.
    • Efficiency: No rendering is required during inference, making it relatively fast (reported 25-45ms per object region, plus detection time).

Evaluation and Results

  • Evaluated on LineMOD, LineMOD Occlusion, and T-Less datasets.
  • Metrics: ADD(-S) for LineMOD, VSD for T-Less.
  • Outperforms state-of-the-art RGB-only methods significantly on all datasets, especially on the challenging T-Less dataset which features texture-less and symmetric industrial objects.
  • Ablation studies confirm the benefits of the transformer loss for symmetry, GAN training for occlusion, and the two-stage refinement process. The method also shows robustness to using less precise 3D models (convex hulls).

Limitations and Future Work

  • Performance can degrade for poses not well represented in the training data or augmentation.
  • Failures can occur with severe occlusion or poor initial 2D detections.
  • Future work includes improving data augmentation strategies and generalizing the approach to handle intra-class variations.

In summary, Pix2Pose offers a practical and effective approach for 6D pose estimation from RGB, notable for its ability to work without textured models, handle symmetries robustly via the transformer loss, and manage occlusion through GAN-based coordinate prediction and a two-stage refinement pipeline feeding into PnP+RANSAC.

Youtube Logo Streamline Icon: https://streamlinehq.com