Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation (1908.07433v1)

Published 20 Aug 2019 in cs.CV

Abstract: Estimating the 6D pose of objects using only RGB images remains challenging because of problems such as occlusion and symmetries. It is also difficult to construct 3D models with precise texture without expert knowledge or specialized scanning devices. To address these problems, we propose a novel pose estimation method, Pix2Pose, that predicts the 3D coordinates of each object pixel without textured models. An auto-encoder architecture is designed to estimate the 3D coordinates and expected errors per pixel. These pixel-wise predictions are then used in multiple stages to form 2D-3D correspondences to directly compute poses with the PnP algorithm with RANSAC iterations. Our method is robust to occlusion by leveraging recent achievements in generative adversarial training to precisely recover occluded parts. Furthermore, a novel loss function, the transformer loss, is proposed to handle symmetric objects by guiding predictions to the closest symmetric pose. Evaluations on three different benchmark datasets containing symmetric and occluded objects show our method outperforms the state of the art using only RGB images.

Authors (3)

Kiru Park (4 papers)
Timothy Patten (13 papers)
Markus Vincze (46 papers)

Citations (417)

View on Semantic Scholar

Summary

This paper introduces Pix2Pose, a method for estimating the 6D pose (3D rotation and 3D translation) of known objects from a single RGB image. The core idea is to train a neural network to directly regress the 3D coordinates of each visible object pixel in the object's canonical coordinate frame. This approach avoids the need for textured 3D models during training, addressing a common limitation where high-quality textured models are unavailable or difficult to create, such as with industrial CAD models.

Key Features and Implementation Details

Pixel-Wise Coordinate Regression:
- Instead of predicting pose parameters directly or matching features, Pix2Pose predicts a "coordinate image" ( $I_{3D}$ ) where each pixel $(u, v)$ corresponding to the object contains the normalized $(x, y, z)$ coordinates of that point on the object's surface.
- The ground truth coordinate images for training are generated by rendering the object's 3D model from the known pose, where the RGB color of each vertex/pixel is mapped directly from its normalized $(x, y, z)$ coordinates (e.g., $x \to R, y \to G, z \to B$ ). See Figure 1 in the paper.
Network Architecture:
- An auto-encoder architecture, similar to U-Net, is used. It takes a 128x128 RGB image patch (cropped around the detected object) as input.
- The encoder uses convolutional layers to extract features.
- Skip connections are used between the encoder and decoder layers to preserve fine spatial details, improving accuracy near boundaries.
- The decoder uses deconvolutional layers to reconstruct the spatial resolution.
- The network has two output heads:
  - One outputs the 3-channel coordinate image $I_{3D}$ (using tanh activation).
  - The other outputs a 1-channel error prediction map $I_e$ (using sigmoid activation), estimating the expected L1 error for each pixel's coordinate prediction.

# Simplified Pseudocode for Architecture
def Pix2PoseNetwork(input_image):
    # Encoder Path (Convolutional Layers)
    e1 = ConvBlock(input_image, filters=..., kernel_size=5, stride=2) # 128 -> 64
    e2 = ConvBlock(e1, filters=..., kernel_size=5, stride=2) # 64 -> 32
    e3 = ConvBlock(e2, filters=..., kernel_size=5, stride=2) # 32 -> 16
    e4 = ConvBlock(e3, filters=..., kernel_size=5, stride=2) # 16 -> 8

    # Bottleneck (Fully Connected Layers)
    bottleneck = Flatten(e4)
    bottleneck = Dense(bottleneck, units=256)
    bottleneck = Reshape(bottleneck, target_shape=...) # Reshape back to spatial dims for decoder

    # Decoder Path (Deconvolutional Layers + Skip Connections)
    d1 = DeconvBlock(bottleneck, filters=..., kernel_size=5, stride=2) # 8 -> 16
    d1 = Concatenate([d1, e3]) # Skip connection
    d2 = DeconvBlock(d1, filters=..., kernel_size=5, stride=2) # 16 -> 32
    d2 = Concatenate([d2, e2]) # Skip connection
    d3 = DeconvBlock(d2, filters=..., kernel_size=5, stride=2) # 32 -> 64
    d3 = Concatenate([d3, e1]) # Skip connection
    d4 = DeconvBlock(d3, filters=..., kernel_size=5, stride=2) # 64 -> 128

    # Output Heads
    coord_output = ConvLayer(d4, filters=3, kernel_size=5, stride=1, activation='tanh') # I_3D
    error_output = ConvLayer(d4, filters=1, kernel_size=5, stride=1, activation='sigmoid') # I_e

    return coord_output, error_output

Training and Loss Functions:
- Data Augmentation: Training uses real images with augmentations: objects are pasted onto random backgrounds (COCO dataset), color jittering, blurring boundaries, simulated occlusion (removing parts of the object), and random in-plane rotations.
- Reconstruction Loss ( $\mathcal{L}_r$ ): A basic L1 loss between the predicted coordinate image $I_{3D}$ and the ground truth $I_{gt}$ . Errors on object pixels are weighted higher (by factor $\beta$ ) than background pixels.
- Transformer Loss ( $\mathcal{L}_{3D}$ ): A novel loss to handle symmetric objects. For objects with known discrete symmetries (e.g., 180-degree rotation for a box), the loss calculates the reconstruction error between the prediction $I_{3D}$ and the ground truth transformed into each possible symmetric pose. The minimum error among these poses is used for backpropagation.
  
  $\mathcal{L}_\textrm{3D}= \min_{p \in \textrm{sym}} \mathcal{L}_\textrm{r}(I_\textrm{3D}, R_{p}I_{gt})$
  
  where $sym$ is the set of symmetry transformations (including identity) and $R_p$ is the 3x3 rotation matrix for symmetry $p$ . This guides the network towards the closest valid symmetric pose without needing predefined view limits.
- Error Prediction Loss ( $\mathcal{L}_e$ ): An L2 loss that trains the network to predict the actual pixel-wise L1 coordinate error: $\mathcal{L}_\textrm{e} =\frac{1}{n} \sum_{i}||I_\textrm{e}^{i} - \textrm{min}(||I_\textrm{3D}^i-I_\textrm{gt}^i||_1, 1)||^2_2$ .
- GAN Loss ( $\mathcal{L}_{GAN}$ ): A Generative Adversarial Network framework is optionally used. A discriminator network tries to distinguish between real (rendered) coordinate images and predicted ones. This encourages the generator (Pix2Pose network) to produce more realistic coordinate maps, particularly helping to "inpaint" coordinates for occluded regions realistically.
- Combined Loss: The final objective combines these losses: $\mathcal{L}_\textrm{total} = \mathcal{L}_\textrm{GAN} + \lambda_1 \mathcal{L}_\textrm{3D} + \lambda_2 \mathcal{L}_\textrm{e}$ . The ablation paper shows GAN training significantly improves robustness to occlusion.
Two-Stage Pose Prediction:
- Input: An initial bounding box from a 2D object detector (e.g., Faster R-CNN, RetinaNet). The crop is padded (1.5x larger) to include context and potentially occluded areas.
- Stage 1 (Mask Refinement & Bbox Adjustment):
1. Run the network on the initial padded crop. 2. Generate a refined mask: Keep pixels where the predicted coordinate magnitude $||I_{3D}||_2$ is non-zero AND the predicted error $I_e$ is below an outlier threshold $\theta_o$ . 3. Calculate the centroid of this refined mask. 4. Create a new, tighter bounding box centered on this centroid. 5. Generate a refined input image by cropping using the new box and masking out pixels not in the refined mask (setting them to black/zero). * Stage 2 (Final Pose Estimation):

1. Run the network again on the refined input image from Stage 1. 2. Generate 2D-3D correspondences: For each pixel $(u, v)$ where the predicted coordinate $I_{3D}(u, v)$ is non-zero and the predicted error $I_e(u, v)$ is below an inlier threshold $\theta_i$ , create a correspondence between the 2D pixel coordinate $(u, v)$ and the predicted 3D coordinate $I_{3D}(u, v)$ . 3. Solve for the 6D pose using the Perspective-n-Point (PnP) algorithm (specifically EPnP) with RANSAC on these 2D-3D correspondences. RANSAC helps reject remaining outlier correspondences based on reprojection error (threshold $\theta_{re}$ ).
Practical Advantages:
- No Textured Models: Works with untextured CAD models or geometry-only scans.
- Occlusion Robustness: The GAN training helps predict coordinates for occluded parts. The PnP+RANSAC step uses only reliable visible points.
- Symmetry Handling: The transformer loss provides a principled way to handle discrete symmetries without complex viewpoint handling.
- Efficiency: No rendering is required during inference, making it relatively fast (reported 25-45ms per object region, plus detection time).

Evaluation and Results

Evaluated on LineMOD, LineMOD Occlusion, and T-Less datasets.
Metrics: ADD(-S) for LineMOD, VSD for T-Less.
Outperforms state-of-the-art RGB-only methods significantly on all datasets, especially on the challenging T-Less dataset which features texture-less and symmetric industrial objects.
Ablation studies confirm the benefits of the transformer loss for symmetry, GAN training for occlusion, and the two-stage refinement process. The method also shows robustness to using less precise 3D models (convex hulls).

Limitations and Future Work

Performance can degrade for poses not well represented in the training data or augmentation.
Failures can occur with severe occlusion or poor initial 2D detections.
Future work includes improving data augmentation strategies and generalizing the approach to handle intra-class variations.

In summary, Pix2Pose offers a practical and effective approach for 6D pose estimation from RGB, notable for its ability to work without textured models, handle symmetries robustly via the transformer loss, and manage occlusion through GAN-based coordinate prediction and a two-stage refinement pipeline feeding into PnP+RANSAC.

PDF Markdown

Related Papers

YouTube

Show All Videos