Papers
Topics
Authors
Recent
Search
2000 character limit reached

PanoTPS-Net: 3D Indoor Layout Estimation

Updated 15 June 2026
  • The paper introduces a CNN-based approach that formulates indoor room layout prediction as a learnable image-warping problem using a differentiable Thin Plate Spline transformation.
  • It employs a modified Xception backbone with a TPS spatial transformation layer to capture complex, non-cuboid geometries through smooth deformation of a reference layout.
  • The method achieves state-of-the-art performance in metrics like 3DIoU and 2DIoU, demonstrating robust handling of both cuboid and irregular room structures.

PanoTPS-Net is a convolutional neural network (CNN) architecture for estimating the 3D layout of indoor rooms from a single 360° equirectangular panorama image via a differentiable Thin Plate Spline (TPS) transformation. The model formulates room layout prediction as a learnable image-warping problem, enabling robust generalization to both cuboid and non-cuboid room types. By leveraging the smoothness and flexibility of TPS, PanoTPS-Net bridges the gap between simple handcrafted reference layouts and the diverse structural complexity found in real-world environments (Ibrahem et al., 13 Oct 2025).

1. Problem Formulation and Motivation

The fundamental challenge addressed by PanoTPS-Net is the estimation of complete 3D room structure—including walls, floor, ceiling boundaries, and corner positions—from a single panoramic RGB image. Traditional approaches generally fall into two categories: edge (boundary) map prediction or direct regression of corner coordinates. These methods either impose restrictive Manhattan-world (rectilinear, cuboid) assumptions or struggle to generalize beyond basic geometry, leading to reduced accuracy in non-cuboid or irregular rooms.

The principal innovation of PanoTPS-Net is the framing of layout estimation as a spatial warping task. Starting from a simple reference layout (such as a canonical cuboid edge/corner map), the model predicts a TPS transformation that smoothly deforms the reference to match the target room shape in the panorama. TPS is selected for its capacity to satisfy precise control-point alignment while maintaining global smoothness by minimizing bending energy:

E[U]=iU(xi,yi)(xi,yi)2+λ[(Uxx)2+2(Uxy)2+(Uyy)2]dxdyE[U] = \sum_i\|U(x_i, y_i) - (x'_i, y'_i)\|^2 + \lambda \iint \left[ (U_{xx})^2 + 2(U_{xy})^2 + (U_{yy})^2 \right] dx \, dy

This property enables the network to capture complex structural variations without the over-flexibility or instability of unconstrained transformation schemes.

2. Network Architecture

PanoTPS-Net employs a two-stage process comprising:

  1. CNN Feature Extractor: The model ingests a resized (1024×512) RGB panorama and computes latent feature embeddings using a modified Xception backbone ("MXception"), characterized by depth-wise separable convolutions for computational efficiency. Post-convolution, global average pooling reduces the spatial output to a feature vector, with the final fully connected layer regressing the parameters of the TPS transformation:
    • Nonlinear control-point offsets BRN×2B \in \mathbb{R}^{N \times 2}, for NN control points.
    • The linear affine part AR2×3A \in \mathbb{R}^{2 \times 3}, typically initialized as the identity.
  2. TPS Spatial Transformation Layer: The predicted TPS parameters are applied to a regular source grid of control points {(xi,yi)}i=1N\{(x_i, y_i)\}_{i=1}^N over the reference map. The TPS deformation for a query coordinate (x,y)(x, y) is given by:

T(x,y)=A(1 x y)+i=1NbiK((x,y)(xi,yi))T(x, y) = A \begin{pmatrix} 1 \ x \ y \end{pmatrix} + \sum_{i=1}^N b_i K(\| (x, y)-(x_i, y_i) \|)

with the standard TPS kernel K(r)=r2logr2K(r) = r^2 \log r^2 and bib_i learned per control point. This layered design enables end-to-end differentiable image warping, integral to training via backpropagation.

3. Learning Objective and Loss Functions

The model outputs two warped predictions:

  • A reference edge map E^\hat{E} (with RGB channels encoding semantic boundaries: wall-wall, wall-ceiling, wall-floor),
  • A one-channel reference corner map BRN×2B \in \mathbb{R}^{N \times 2}0 (as a corner-location heatmap).

A pixelwise Huber loss (BRN×2B \in \mathbb{R}^{N \times 2}1) penalizes deviations between predictions and ground truth:

BRN×2B \in \mathbb{R}^{N \times 2}2

The aggregate loss is a weighted sum:

BRN×2B \in \mathbb{R}^{N \times 2}3

with best performance at BRN×2B \in \mathbb{R}^{N \times 2}4. This dual-output formulation enforces both fine-grained boundary alignment and precise corner localization.

4. Training Procedure and Datasets

PanoTPS-Net is trained on a variety of public datasets encompassing both cuboid and non-cuboid layouts:

Dataset Panorama Count Layout Type Usage
PanoContext (PC) 500 Cuboid Train/Test
Stanford-2D3D (S2D3D) 571 Cuboid Train/Test
Matterport3DLayout (MP3D) 2295 Mixed Train/Val/Test
Zillow Indoor (ZInD) ~31,000 Non-cuboid prevalent Test

Panoramas are resized to 1024×512 and normalized with ImageNet statistics. No geometric augmentation is performed aside from random horizontal flips. In non-cuboid settings, a corner map post-processing step splits merged corner blobs using a 75px width threshold for accurate localization.

Optimization is performed in TensorFlow/Keras with Adam (learning rate 1e-4, weight decay 1e-6, batch size 8), using up to 500 epochs with early stopping. Pretrained Xception weights initialize the MXception backbone.

5. Evaluation Metrics and Comparative Performance

Performance is assessed via multiple criteria:

  • 3D Intersection-over-Union (3DIoU): Measures volumetric overlap of predicted versus ground-truth cuboid layouts.
  • 2D IoU: Evaluates planar overlap for more general, non-cuboid footprints.
  • Corner Error (CE): Pixelwise distance between predicted and true corner positions.
  • Pixel Error (PE): Per-pixel edge map difference.

PanoTPS-Net achieves competitive or superior results compared to previous approaches:

Dataset 3DIoU (%) 2DIoU (%) Prior best (3DIoU/2DIoU)
PanoContext 85.49 ∼85.02
Stanford-2D3D 86.18 ∼86.60
Matterport3DLayout 81.76 84.15 ∼81.70 / 84.11
Zillow Indoor (ZInD) 91.98 90.05 ∼91.94 / 90.13

These outcomes underscore the compatibility of TPS with panoramic input and its ability to handle complex indoor geometry.

6. Qualitative Analysis

Visualization of TPS control-point deformation (cf. Figure 1 in the primary source) reveals that source grid points (yellow dots) are smoothly mapped to targets (orange), permitting substantial but controlled shape adaptation. Sample outputs (Figures 3 and 3-1) demonstrate that reference cuboid maps can be deformed to T-shaped, L-shaped, and other multi-corner configurations.

Bird's-eye and 3D reconstructions (Figures 5 and 6) indicate superior geometric fidelity for non-Manhattan rooms compared to methods such as LED2Net, LGT-Net, and DOPNet, which often impose strong rectilinear biases or miss irregular structures.

7. Ablation Studies and Analysis of TPS Role

A sequence of controlled experiments isolates key architectural and design choices:

  • Backbone Selection: Off-the-shelf networks (ResNet50, InceptionV3, EfficientNet, ConvNeXt) either failed to converge or underperformed (3DIoU 30–75%). The MXception backbone reached 85.49%.
  • Warping Outputs: Warping only edge or corner maps led to reduced accuracy or convergence failure. Warping both yielded best results (3DIoU 85.5% on PC, 81.8% on MP3D).
  • Loss Weights: Emphasizing the edge loss (BRN×2B \in \mathbb{R}^{N \times 2}5) was essential; corner-only supervision was too sparse.
  • TPS Control Points: Optimal flexibility was achieved with a moderate control-point count (16 for simple layouts, 64 for complex). Too few points led to poor fit; too many induced over-flexibility and artifacts in cuboid rooms.
  • Corner Post-processing: A threshold of 75 px for splitting merged corner blobs matched ground-truth counts most effectively.

These findings corroborate the importance of TPS-based spatial transformation and joint edge/corner warping for stable and generalizable layout estimation across diverse room geometries (Ibrahem et al., 13 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PanoTPS-Net.