SwinUNETR Neural Network

Updated 22 November 2025

The SwinUNETR neural network is a transformer-enhanced, U-shaped architecture that captures rich voxel-level context for accurate 3D path-loss prediction.
It processes multimodal voxel inputs—including occupancy, reflection, transmission, and distance—using ITU-R physics-based parameters to generate dense environmental attenuation maps.
Real-time inference benchmarks (MAE ~4.27 dB) on unseen apartment datasets highlight its robust generalization and advantage over traditional, manually annotated models.

A SwinUNETR-based neural network combines the Swin Transformer backbone with a U-shaped encoder–decoder architecture to enable efficient, context-aware voxel-level prediction for volumetric data. In the context of wireless propagation modeling, such as the SenseRay-3D framework, this architecture processes rich multimodal voxelized scene representations to infer 3D environmental path-loss directly from RGB-D sensor input, bypassing explicit geometry or manual material annotation (Zheng et al., 15 Nov 2025).

1. Architectural Foundation of SwinUNETR

SwinUNETR is an adaptation of the Swin Transformer architecture for U-Net style volumetric image segmentation. It integrates:

Hierarchical patch-based self-attention via the Swin Transformer blocks, leveraging local and shifted windowed attention for scalable, low-complexity feature extraction across spatial hierarchies.
Encoder–decoder topology with skip connections analogous to U-Net, preserving fine details by combining high-resolution spatial features from encoder stages with deep, semantic-rich decoder representations.
Patch embeddings where the input volumetric grid is partitioned into non-overlapping windows, with cross-window information combined through shifted windowing at successive layers for increased receptive field without incurring quadratic attention scaling.

Applied in SenseRay-3D, the SwinUNETR is configured to process 224 × 224 × 64 voxel grids (0.1 m³/voxel), accommodating furnished apartment volumes up to ~22 m × 22 m × 6 m (Zheng et al., 15 Nov 2025).

2. Data Representation and Input Modalities

The SwinUNETR-based neural network in SenseRay-3D utilizes a sensing-driven voxelized input representation, where each voxel is annotated with four channels:

Channel	Description	Source
Occupancy ( $o(v)$ )	Binary: free-space vs. occupied	Fused RGB-D point cloud
Reflection coefficient ( $\rho(v)$ ) [dB]	ITU-R P.2040-derived Fresnel reflection	Semantic material mapping
Transmission coefficient ( $\tau(v)$ ) [dB]	ITU-R P.2040-derived Fresnel transmission	Semantic material mapping
Euclidean distance to Tx ( $d(t,v)$ ) [m]	Distance from transmitter to voxel center	Transmitter voxel grid placement

These four channels are computed through a pipeline that includes virtual RGB-D rendering in BlenderProc, multi-view point cloud fusion, semantic segmentation, and material-centric electromagnetic property extraction (including mapping semantic classes to ITU-R-compliant dielectric and conductivity values) (Zheng et al., 15 Nov 2025).

3. Forward Pass and Prediction Task

Given the structured voxel grid input, the SwinUNETR-based network is trained to regress the environmental path-loss heatmap $L_{\mathrm{env}}(t, v)$ at each voxel, corresponding to:

$L(t, v) = L_{\mathrm{fspl}}(t, v) + L_{\mathrm{env}}(t, v)$

where the free-space path-loss is computed analytically as:

$L_{\mathrm{fspl}}(t,v) = 20 \log_{10} d(t,v) + 20 \log_{10} f - 147.55$

and environmental attenuation $L_{\mathrm{env}}$ captures all remaining effects of blockage, obstruction, and material interaction (Zheng et al., 15 Nov 2025). The ground-truth labels are obtained by subtracting analytical FSPL from the total simulated path-loss returned by Sionna RT GPU-based ray tracing.

The network outputs dense, height-indexed 2D heatmaps at 11 vertical slices (0.6 m to 1.6 m in 0.1 m increments), enabling full volumetric path-loss reconstruction.

4. Training Methodology and Performance Benchmarks

Training proceeds over 1 446 samples across 56 apartments, with validation on 359 samples using unseen transmitter placements and testing on 225 samples in six entirely novel apartments. Optimization targets mean absolute error (MAE) and RMSE across the prediction volume, demonstrating the model's ability to generalize:

Split	MAE (dB)	RMSE (dB)
Train	2.88	4.08
Val	3.51	4.99
Test	4.27	6.32

Inference is performed at 217 ms per sample on an NVIDIA RTX 5080, supporting real-time usage scenarios (Zheng et al., 15 Nov 2025).

5. Comparison to Prior Methods and Advantages

Traditional indoor radio propagation models require manual, per-scene geometry reconstruction and material annotation, leading to poor scalability. Previous learning-based systems, such as EM DeepRay, depended on explicit geometry or limited RGB-only cues. By contrast, the SwinUNETR-based approach in SenseRay-3D:

Eliminates the need for handcrafted environmental models by ingesting raw sensor data.
Encodes physics-based features (via ITU-R parameters) into the voxel grid, incorporating reflection and transmission properties explicitly.
Generalizes to unseen environments through robust training on a diverse synthetic dataset, making it suitable for scalable, sense-driven deployment in residential wireless planning (Zheng et al., 15 Nov 2025).

6. Dataset Context and Reproducibility

The synthetic dataset underpinning SwinUNETR-based inference comprises:

56 furnished apartments with 1 805 unique transmitter–scene ray-tracing samples at 3.5 GHz (5G N78).
Voxelized representations (224 × 224 × 64 at 0.1 m resolution), storing per-voxel occupancy, material response ( $\rho(v), \tau(v)$ ), transmitter distance, and full path-loss label heatmaps.
Storage in HDF5 format for efficient access and standardized annotation, with per-voxel and per-slice details suitable for supervised deep learning workflows.

Such a benchmark enables consistent evaluation and rapid iteration of alternative volumetric neural architectures or representation schemes for indoor radio environments (Zheng et al., 15 Nov 2025).

7. Relevance to Broader Research in Propagation Modeling

SwinUNETR-based neural networks exemplify a broader shift in indoor wireless modeling from analytic or empirical formulae to supervised, data-driven inference leveraging rich geometric and material cues. Comparable datasets, such as WiSegRT (Zhang et al., 2023) and DeepTelecom (Wang et al., 20 Aug 2025), have adopted GPU-accelerated ray-tracing, semantic segmentation, and physics-informed modeling principles, but differ in the input/output modality and ML architectural choices. The integration of transformer-based architectures with physical prior encoding, as in SenseRay-3D, enables real-time and generalizable path-loss prediction at high spatial resolutions, advancing the state-of-the-art in digital-twin and planning workflows for future wireless networks (Zheng et al., 15 Nov 2025).