PanoTPS-Net: 3D Indoor Layout Estimation
- The paper introduces a CNN-based approach that formulates indoor room layout prediction as a learnable image-warping problem using a differentiable Thin Plate Spline transformation.
- It employs a modified Xception backbone with a TPS spatial transformation layer to capture complex, non-cuboid geometries through smooth deformation of a reference layout.
- The method achieves state-of-the-art performance in metrics like 3DIoU and 2DIoU, demonstrating robust handling of both cuboid and irregular room structures.
PanoTPS-Net is a convolutional neural network (CNN) architecture for estimating the 3D layout of indoor rooms from a single 360° equirectangular panorama image via a differentiable Thin Plate Spline (TPS) transformation. The model formulates room layout prediction as a learnable image-warping problem, enabling robust generalization to both cuboid and non-cuboid room types. By leveraging the smoothness and flexibility of TPS, PanoTPS-Net bridges the gap between simple handcrafted reference layouts and the diverse structural complexity found in real-world environments (Ibrahem et al., 13 Oct 2025).
1. Problem Formulation and Motivation
The fundamental challenge addressed by PanoTPS-Net is the estimation of complete 3D room structure—including walls, floor, ceiling boundaries, and corner positions—from a single panoramic RGB image. Traditional approaches generally fall into two categories: edge (boundary) map prediction or direct regression of corner coordinates. These methods either impose restrictive Manhattan-world (rectilinear, cuboid) assumptions or struggle to generalize beyond basic geometry, leading to reduced accuracy in non-cuboid or irregular rooms.
The principal innovation of PanoTPS-Net is the framing of layout estimation as a spatial warping task. Starting from a simple reference layout (such as a canonical cuboid edge/corner map), the model predicts a TPS transformation that smoothly deforms the reference to match the target room shape in the panorama. TPS is selected for its capacity to satisfy precise control-point alignment while maintaining global smoothness by minimizing bending energy:
This property enables the network to capture complex structural variations without the over-flexibility or instability of unconstrained transformation schemes.
2. Network Architecture
PanoTPS-Net employs a two-stage process comprising:
- CNN Feature Extractor: The model ingests a resized (1024×512) RGB panorama and computes latent feature embeddings using a modified Xception backbone ("MXception"), characterized by depth-wise separable convolutions for computational efficiency. Post-convolution, global average pooling reduces the spatial output to a feature vector, with the final fully connected layer regressing the parameters of the TPS transformation:
- Nonlinear control-point offsets , for control points.
- The linear affine part , typically initialized as the identity.
- TPS Spatial Transformation Layer: The predicted TPS parameters are applied to a regular source grid of control points over the reference map. The TPS deformation for a query coordinate is given by:
with the standard TPS kernel and learned per control point. This layered design enables end-to-end differentiable image warping, integral to training via backpropagation.
3. Learning Objective and Loss Functions
The model outputs two warped predictions:
- A reference edge map (with RGB channels encoding semantic boundaries: wall-wall, wall-ceiling, wall-floor),
- A one-channel reference corner map 0 (as a corner-location heatmap).
A pixelwise Huber loss (1) penalizes deviations between predictions and ground truth:
2
The aggregate loss is a weighted sum:
3
with best performance at 4. This dual-output formulation enforces both fine-grained boundary alignment and precise corner localization.
4. Training Procedure and Datasets
PanoTPS-Net is trained on a variety of public datasets encompassing both cuboid and non-cuboid layouts:
| Dataset | Panorama Count | Layout Type | Usage |
|---|---|---|---|
| PanoContext (PC) | 500 | Cuboid | Train/Test |
| Stanford-2D3D (S2D3D) | 571 | Cuboid | Train/Test |
| Matterport3DLayout (MP3D) | 2295 | Mixed | Train/Val/Test |
| Zillow Indoor (ZInD) | ~31,000 | Non-cuboid prevalent | Test |
Panoramas are resized to 1024×512 and normalized with ImageNet statistics. No geometric augmentation is performed aside from random horizontal flips. In non-cuboid settings, a corner map post-processing step splits merged corner blobs using a 75px width threshold for accurate localization.
Optimization is performed in TensorFlow/Keras with Adam (learning rate 1e-4, weight decay 1e-6, batch size 8), using up to 500 epochs with early stopping. Pretrained Xception weights initialize the MXception backbone.
5. Evaluation Metrics and Comparative Performance
Performance is assessed via multiple criteria:
- 3D Intersection-over-Union (3DIoU): Measures volumetric overlap of predicted versus ground-truth cuboid layouts.
- 2D IoU: Evaluates planar overlap for more general, non-cuboid footprints.
- Corner Error (CE): Pixelwise distance between predicted and true corner positions.
- Pixel Error (PE): Per-pixel edge map difference.
PanoTPS-Net achieves competitive or superior results compared to previous approaches:
| Dataset | 3DIoU (%) | 2DIoU (%) | Prior best (3DIoU/2DIoU) |
|---|---|---|---|
| PanoContext | 85.49 | – | ∼85.02 |
| Stanford-2D3D | 86.18 | – | ∼86.60 |
| Matterport3DLayout | 81.76 | 84.15 | ∼81.70 / 84.11 |
| Zillow Indoor (ZInD) | 91.98 | 90.05 | ∼91.94 / 90.13 |
These outcomes underscore the compatibility of TPS with panoramic input and its ability to handle complex indoor geometry.
6. Qualitative Analysis
Visualization of TPS control-point deformation (cf. Figure 1 in the primary source) reveals that source grid points (yellow dots) are smoothly mapped to targets (orange), permitting substantial but controlled shape adaptation. Sample outputs (Figures 3 and 3-1) demonstrate that reference cuboid maps can be deformed to T-shaped, L-shaped, and other multi-corner configurations.
Bird's-eye and 3D reconstructions (Figures 5 and 6) indicate superior geometric fidelity for non-Manhattan rooms compared to methods such as LED2Net, LGT-Net, and DOPNet, which often impose strong rectilinear biases or miss irregular structures.
7. Ablation Studies and Analysis of TPS Role
A sequence of controlled experiments isolates key architectural and design choices:
- Backbone Selection: Off-the-shelf networks (ResNet50, InceptionV3, EfficientNet, ConvNeXt) either failed to converge or underperformed (3DIoU 30–75%). The MXception backbone reached 85.49%.
- Warping Outputs: Warping only edge or corner maps led to reduced accuracy or convergence failure. Warping both yielded best results (3DIoU 85.5% on PC, 81.8% on MP3D).
- Loss Weights: Emphasizing the edge loss (5) was essential; corner-only supervision was too sparse.
- TPS Control Points: Optimal flexibility was achieved with a moderate control-point count (16 for simple layouts, 64 for complex). Too few points led to poor fit; too many induced over-flexibility and artifacts in cuboid rooms.
- Corner Post-processing: A threshold of 75 px for splitting merged corner blobs matched ground-truth counts most effectively.
These findings corroborate the importance of TPS-based spatial transformation and joint edge/corner warping for stable and generalizable layout estimation across diverse room geometries (Ibrahem et al., 13 Oct 2025).