TRT-NLOS: Transformer-Based NLOS Imaging
- TRT-NLOS is a transformer-based framework designed to reconstruct 3D hidden scenes from highly noisy, indirect photon measurements in non-line-of-sight imaging.
- It integrates a denoising head, shallow physics-based feature extraction, and both local and global spatio-temporal transformer blocks to resolve complex light transport challenges.
- Experiments on synthetic and real datasets demonstrate superior performance in PSNR, SSIM, RMSE, and MAD, highlighting its robustness and potential for practical NLOS applications.
Time-Resolved Transformer for Non-Line-of-Sight (TRT-NLOS) imaging refers to a transformer-based deep neural network architecture specifically designed for the 3D reconstruction of hidden scenes from highly noisy, indirect transient measurements, as encountered in non-line-of-sight (NLOS) imaging. The approach is characterized by specialized spatio-temporal attention mechanisms and physics-driven feature extraction that collectively address the extreme photon noise, complex indirect light transport, and ambiguity inherent in transient NLOS acquisition.
1. Architecture and Core Components
TRT-NLOS is an instantiation of the general Time-Resolved Transformer (TRT) backbone (Li et al., 10 Oct 2025), tailored for NLOS imaging scenarios. Its architectural pipeline comprises the following stages:
- Denoising head: Efficiently removes severe measurement noise from input transient data, which is essential for NLOS tasks due to the extremely weak and background-dominated multi-bounce signals.
- Shallow feature extraction: Combines 3D convolutions with a physics-informed transformation, such as an FK transform (or similar spectral domain mapping), to project the denoised transient measurements into a 3D space more amenable to spatial correlation analysis.
- Spatio-temporal transformer blocks:
- Self-attention encoders: Two branches:
- Local branch: Divides features into spatio-temporal windows and applies window-based self-attention, capturing fine-grained, short-range dependencies crucial for preserving local continuity and resolving small-scale geometric structure.
- Global branch: Applies full self-attention after spatial and/or temporal downsampling, modeling long-range relationships and enforcing global consistency.
- Cross-attention decoders: Integrate upsampled global and local features in the token space via sequential spatial and temporal cross-attention. These operations are implemented via matrix multiplications on appropriately reshaped token matrices (e.g., HW×D×C, D×HW×C).
- Feature fusion: Concatenates and fuses the deep features (local/global) with the shallow features using a series of 3D convolutions.
- Volume prediction: Outputs a 3D volumetric reconstruction V; the intensity image Ĩ = max_z(V) and depth map D̂ = argmax_z(V) are computed by maximum and arg-maximum projections.
Comparison to TRT-LOS: In TRT-LOS (for line-of-sight imaging), the denoising head is omitted because inputs have a higher SNR, and the shallow feature extractor skips the explicit physics-based transformation due to the direct, less ambiguous light transport. TRT-NLOS, by contrast, includes both to accommodate more severe NLOS challenges.
2. Spatio-Temporal Attention Mechanisms
The quantitative and qualitative success of TRT-NLOS arises from its unique attention mechanisms:
- Local self-attention: Enforces local smoothness and recovers high-frequency spatial and temporal structures. Partitioning the feature map into windows before sequentially applying spatial then temporal attention allows the network to disambiguate temporally overlapped returns (critical for resolving different NLOS path contributions).
- Global self-attention: After downsampling, this branch aggregates contextual information across broad regions—necessary for understanding the global scene layout, correcting large-scale ambiguities, and handling sparsity in photon-efficient measurements.
- Cross-attention: Enables the system to reconcile detailed local reconstructions with broader spatial context, thereby enhancing overall robustness to instrument noise, sparsity, and indirect path mixing.
Implementation details include 1×1×1 convolutions to map features to Q/K/V, followed by attention operations in both spatial and temporal modes, reshaped for matrix multiplications.
3. Reconstruction Performance and Benchmarks
Experiments on both synthetic and real NLOS datasets validate the superiority of TRT-NLOS:
- Synthetic evaluation:
- Test sets: "Seen" (ShapeNet objects used in training), "Unseen" (ShapeNet novel categories; e.g., benches, sofas, cabinets)
- Metrics: Peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), root mean square error (RMSE), and mean absolute distance (MAD).
- Results: TRT-NLOS delivers the highest PSNR (≈24.15 dB) and SSIM (≈0.8610) in the Seen set, with substantial improvements in RMSE and MAD against FBP, LCT, FK, RSD, LFE, I-K, and NLOST baselines. On Unseen data, TRT-NLOS persists in outperforming alternative deep models by more than 1 dB PSNR and achieves lower RMSE/MAD.
- Real-world data evaluation:
- Tested on: Datasets from Lindell et al. and new data from a confocal NLOS imaging system (532 nm pulsed laser, SPAD detector, raster scanning).
- Findings: The model reconstructs both coarse shape and fine geometric details. The fusion of deep transformer features and shallow physics-based priors allows the network to generalize, preserving boundary sharpness and suppressing noise more effectively than other approaches.
4. Training Data and Generalization
- Synthetic Datasets: The authors constructed a large-scale, high-resolution dataset with diverse lighting, geometry (256×256 spatial, 512 temporal bins), and noise levels (including SBR as low as 1:100), ensuring that TRT-NLOS learns robust features and handles challenging input statistics.
- Real-World Measurements: Demonstrate the network's ability to generalize across instrument defects and environmental variability, a critical benchmark for translational deployment.
5. Mathematical Formulations and Processing Framework
The forward NLOS imaging model used for both simulation and network supervision is:
where τ(·) is the time-resolved measurement for laser position , wall point , and time t; ρ is albedo, f is the BRDF, φ is laser power, δ the Dirac delta, and distances scale the light transport with speed of light c.
The acquired histogram: with quantum efficiency η and background count B, governs post-processing and noise modeling.
The volumetric reconstruction aggregates features as:
- where F_S* are shallow features, F_l* and F_G* are upsampled local/global features, FUS is modeled by 3D convolutions, and the final outputs are extracted by
- for intensity and depth, respectively.
6. Impact and Directions for Future Research
The architecture's strengths—integrating transformer-based attention with physics-prior feature mapping and robust denoising—directly address the extreme measurement conditions in practical NLOS imaging. Open directions include:
- Architectural optimization: Reducing computation and memory footprint for real-time deployment, such as on embedded edge devices.
- Physics-driven generalization: Extension to more complex indirect light transport regimes, e.g., subsurface scattering, or further integration of physically plausible priors within deep attention modules.
- Application domains: Transferring these methods to medical imaging, remote sensing, or other time-resolved inverse problems characterized by sparse acquisition and multi-modal photon transport.
- Dataset expansion: Further development of real-world challenging measurement datasets to drive innovation in robust, generalizable NLOS imaging models.
- Robustness and uncertainty quantification: Exploring probabilistic layers or Bayesian post-processing atop the transformer backbone to enable uncertainty-aware NLOS reconstructions.
7. Summary Table: Key TRT-NLOS Design Elements
| Module | Functionality | Motivation for NLOS |
|---|---|---|
| Denoising Head | Removes severe photon and environmental noise | NLOS data is highly noise-dominated |
| Physics-Based Feature Extraction | Applies domain transforms (e.g., FK) | Projects indirect transient data to geometric space |
| Local/Global Self-Attention | Captures fine and context-aware spatiotemporal relations | Resolves overlapping multipath, global context |
| Spatio-Temporal Cross-Attention | Integrates structures with context at multiple scales | Mitigates ambiguities from indirect propagation |
| Feature Fusion/3D Convolutions | Aggregates all levels for volume prediction | Enhances final depth/intensity estimation |
In summary, TRT-NLOS constitutes a comprehensive, transformer-based framework for photon-efficient, high-resolution NLOS 3D reconstruction, exhibiting superior robustness, accuracy, and generalization across simulation and real-world scenarios, facilitated by a blend of deep attention mechanisms and physics-informed modular design (Li et al., 10 Oct 2025).