3D Reconstruction from Transient Measurements with Time-Resolved Transformer (2510.09205v1)

Published 10 Oct 2025 in cs.CV and eess.IV

Abstract: Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at https://github.com/Depth2World/TRT.

Summary

The paper's key contribution is the TRT architecture leveraging spatio-temporal self- and cross-attentions for robust 3D reconstruction.
It implements dedicated LOS and NLOS modes with advanced feature extraction, deep fusion, and denoising for precise transient imaging.
Experimental results on synthetic and real-world datasets show superior RMSE and fidelity, underscoring TRT's practical impact.

3D Reconstruction from Transient Measurements with Time-Resolved Transformer

Overview

The paper presents the Time-Resolved Transformer (TRT) architecture designed to enhance 3D reconstruction performance in photon-efficient imaging tasks, specifically for Line-of-Sight (LOS) and Non-Line-of-Sight (NLOS) imaging. The architecture leverages spatio-temporal self-attention (STSA) and spatio-temporal cross-attention (STCA) mechanisms to exploit local and global correlations inherent in transient data. The TRT architecture has been validated using both synthetic datasets and real-world data from custom-built imaging systems, demonstrating superior performance over existing methods.

Time-Resolved Transformer Architecture

The TRT employs two main attention mechanisms: STSA and STCA, which are designed to extract local and global features from transient data. The STSA encoder splits the input into patches for local feature extraction and downsamples for global feature extraction. These are processed along spatial and temporal dimensions to capture relevant correlations.

Spatio-Temporal Self-Attention (STSA): Extracts local correlations through a window-based approach and global correlations through full attention mechanisms.
Spatio-Temporal Cross Attention (STCA): Integrates local and global features, enhancing feature representation before reconstruction.
Figure 1: An overview of the proposed Time-Resolved Transformer architecture, illustrating the attention mechanisms.

Line-of-Sight Imaging Implementation

TRT-LOS is tailored for LOS reconstruction tasks using photon-efficient transient measurements. The architecture includes feature extraction, TRT blocks for transformation into deep features, and a deep-feature fusion module. A key aspect involves addressing interpolation artifacts using pixel shuffle operations for upsampling.

Feature Extraction Module: Downsamples input and applies a combination of interlaced convolutions to capture spatial and temporal information efficiently.
Deep Feature Fusion: Combines local and global deep features, improving the clarity and detail of reconstructed 3D volumes.
Figure 2: The flowchart of the TRT-LOS framework, highlighting feature extraction and fusion processes.

Non-Line-of-Sight Imaging Implementation

TRT-NLOS extends the TRT architecture for dealing with occluded or indirectly viewed scenes. Notable is the inclusion of a transient measurement denoiser, which pre-processes data to improve reconstruction quality.

Transient Measurement Denoiser: Uses lightweight convolutions to enhance the input quality, crucial for the challenging NLOS scenarios.
Shallow-Deep Feature Fusion: Integrates features from physics-based priors and deep features, ensuring accurate depth and intensity reconstructions.
Figure 3: Flowchart of the TRT-NLOS, emphasizing denoising and feature integration components.

Experimental Evaluation and Results

Extensive experiments on both synthetic and real-world datasets were performed:

Synthetic Datasets: Show TRT's superior ability to handle various SBR conditions, and complex scenes outperformed existing frameworks significantly in terms of RMSE and fidelity.
Real-World Data: Demonstrated TRT's robustness across different scenarios and imaging systems, maintaining high reconstruction precision in practical long-range setups.
Figure 4: Reconstructed results from synthetic test sets, illustrating depth and intensity recoveries under different SBR conditions.

Implications and Future Work

The TRT architecture provides a leap in reconstructive capability for both LOS and NLOS imaging by exploiting complex correlations within high-dimensional data. Future research directions could focus on scaling the TRT to other spatio-temporal data domains and further optimizing attention mechanisms for even greater efficiencies. Besides, increasing dataset diversity and expanding applications could refine the model's adaptability and deployment in real-world environments.