Synthetic Indoor Propagation Dataset
- Synthetic indoor propagation datasets are curated collections of simulated wireless channel measurements enriched with detailed scene metadata to emulate realistic indoor environments.
- They employ advanced techniques such as GPU-accelerated ray tracing, digital twin construction, and generative adversarial models to capture multipath and attenuation effects.
- These datasets support robust AI-based channel modeling, wireless localization, and benchmarking by providing reproducible, scalable, and richly annotated testbeds.
A synthetic indoor propagation dataset is a curated corpus of radio channel measurements and associated scene data generated through simulation or generative models, rather than physical measurement campaigns. These datasets aim to represent, with controlled fidelity, the rich multipath and attenuation phenomena characteristic of realistic indoor wireless channels. They facilitate research in wireless communication, localisation, and AI-based channel modeling, supporting reproducible experiments, scalable benchmarking, and advanced generalization beyond what is feasible with measured datasets alone.
1. Methodologies for Synthetic Indoor Propagation Dataset Generation
Synthetic indoor propagation datasets are produced via two main principles: (a) physically-informed simulation frameworks and (b) generative modeling techniques. The most advanced approaches use automated digital-twin pipelines combining detailed geometric modeling, electromagnetic material annotation, and GPU-accelerated ray tracing to yield high-fidelity, fully-annotated multi-modal outputs.
DeepTelecom defines a four-stage digital-twin pipeline (Wang et al., 20 Aug 2025):
- LoD3 Indoor Digital Twin Construction: 3D LiDAR scans, fused via Unscented Kalman Filter (UKF), generate global point clouds, which are segmented and meshed (e.g., as OBJ/FBX), with each mesh face tagged by a LLM for electromagnetic properties (e.g., complex permittivity , permeability ).
- Parameter Configuration: System-level definitions include base station (BS)/mobile terminal (MT)/RIS placement, antenna array geometry (e.g., ), and bandwidth/sampling grid for channel characterization.
- Ray Tracing: High-precision paths are computed with GPU-accelerated engines (e.g., NVIDIA Sionna+OptiX), incorporating Fresnel coefficients, Uniform Theory of Diffraction (UTD), path loss, and full 3D trajectory logging.
- Channel Data Extraction: Outputs include time/frequency-domain MIMO channel representations (CIR, CFR), multi-scale fading traces, synchronized RGB visualizations, and coverage heatmaps.
Alternative simulation-centric frameworks employ semi-automatic 3D scene building from 2D floorplans (Fu et al., 20 Sep 2024), use differentiable ray tracing tied directly to per-object electromagnetic and surface-roughness parameters (Zhang et al., 2023), or exploit RGB-D scan voxelization with per-material ITU-R properties for photo-realistic 3D semantic mapping (Zheng et al., 15 Nov 2025).
Generative data augmentation using GANs enables low-cost sample generation for fingerprint-based indoor localization, trained to mimic the statistical properties of measured RSS/CSI data in specific operational sub-classes (Nabati et al., 2021).
2. Physical Propagation Models and Key Equations
The core of high-fidelity dataset generation is the explicit modeling of the rich multipath environment using the principles of geometric optics (GO), UTD, and (if needed) stochastic small-scale fading models:
- Free-space path loss (Friis law):
- Multi-wall path-loss model:
where is typically $2$ to $3.5$ indoors.
- Fresnel reflection coefficients:
- MIMO channel models:
- CIR:
- CFR:
Stochastic small-scale models (Rayleigh/Rician) may be applied at the CIR tap level for fading emulation. Hybrid models can combine measured and synthetic data using adversarial learning to mimic environment-specific statistics.
3. Dataset Structures, Modalities, and Annotation
Synthetic indoor propagation datasets may comprise a rich set of outputs, typically bundled per scenario in a standardized directory. The table below summarizes the canonical DeepTelecom structure (Wang et al., 20 Aug 2025):
| Output | Dimensions | Type | Format |
|---|---|---|---|
| CIR | [T × N_r × N_t × L_max] | complex64 | HDF5 |
| CFR | [F × N_r × N_t] | complex64 | HDF5 |
| Small-scale fading | [T × N_r × N_t] | float32 | HDF5 |
| Large-scale metrics | [T × {PL,DS,AS,ASD}] | float32 | HDF5 |
| RGB frames | [30 fps × 1080 × 1920 × 3] | uint8 | MP4 |
| Heatmap frames | [30 fps × 1080 × 1920] | float32 | MP4 |
| Path logs | [#rays × max_depth × (3D coords+labels)] | float32, int | npz/CSV |
Each scenario typically includes metadata (JSON/YAML) specifying geometry, materials, hardware configuration, and transmit/receive site sampling.
Voxelized datasets (e.g., SenseRay-3D (Zheng et al., 15 Nov 2025)) provide dense per-voxel annotation: occupancy, reflection (), transmission (), distance to Tx, and FSPL baseline, with path-loss heatmaps across multiple elevation layers. Datasets supporting fine-grained segmentation (e.g., WiSegRT (Zhang et al., 2023)) allow explicit per-mesh EM parameterization, crucial for site-specific generalization.
4. Statistical Properties and Coverage
Statistical characterization, essential for verifying realism and benchmarking models, typically includes:
- Path-loss exponent : Gaussian-distributed with over 10 scenes and 10 000 Tx-Rx pairs (Wang et al., 20 Aug 2025); alternative datasets report in dense environments (Zhang et al., 2023).
- Delay spread : Log-Normal with (seconds), revealing the "multipath richness" of complex indoor layouts (Wang et al., 20 Aug 2025).
- Angular spreads: Azimuth AoA spread as Weibull (), elevation as Gaussian ().
- Path-loss dynamic range: Spanning 30–140 dB (close-in LOS to deep NLoS) (Zheng et al., 15 Nov 2025), with standard deviations typically 4–18 dB.
- Comparison of scene granularity: Inclusion of fine furniture and high-resolution segmentation increases RMS delay spread and path-loss variance, underscoring the importance of highly detailed geometric and material annotation for ML generalization (Zhang et al., 2023).
5. Applications and Best Practices
Synthetic indoor propagation datasets serve as benchmarks and substrates for:
- Channel estimation training: Used as input/output for denoising (e.g., U-Net on CIR/CFR tensors (Wang et al., 20 Aug 2025)).
- MIMO beamforming: Learning spatially-precoded weights guided by AoD/AoA and full-channel matrices.
- Digital twin research: Sensitivity analysis through real-time EM property modification; supports studies of dynamic blockage and scenario variation (Wang et al., 20 Aug 2025).
- Benchmarking machine learning architectures: Scene-level train/val/test splitting as in SenseRay-3D (Zheng et al., 15 Nov 2025) provides rigorous separation between spatial memorization and true generalization.
- Data augmentation and transfer learning: Synthetic examples boost coverage and enable pretraining for transfer to real-world data (Nabati et al., 2021).
Best practices include:
- Amplitude normalization by Friis reference prior to deep model training.
- Temporal subsampling to system integration timescales.
- Scenario filtering by material/structure through provided metadata.
- Augmentation through spatial rotations and flips to enhance robustness to geometric variation.
6. Representative Datasets and Comparative Features
Several prominent datasets implement these methodologies:
- DeepTelecom (Wang et al., 20 Aug 2025): LoD3 digital twins, GPU ray tracing, multimodal outputs (CIR, CFR, heatmaps, RGB), and comprehensive metadata.
- SenseRay-3D (Zheng et al., 15 Nov 2025): Physics-informed voxelized inputs, ray-traced ground-truth heatmaps, scene-level evaluation splits, and open-access release with code.
- WiSegRT (Zhang et al., 2023): Fine 3D segmentation, per-object EM properties, differentiable ray tracing, path logs, and direct support for ML and digital twin workflows.
- Wave Propagation Model Dataset (Fu et al., 20 Sep 2024): 3D CAD generation from 2D plans, labeled EM parameters, Wireless InSite RT, and ML-ready image encodings.
- Synthetic GAN-Generated Localization Data (Nabati et al., 2021): GANs trained on RSS “fingerprints” to augment and replace costly real measurements, maintaining test accuracy with a 90% reduction in real data collection.
Dataset Access and Format
Structures range from HDF5 for tensor data, MP4 for visualization, and JSON/YAML for scenario metadata, to CSV or pickle files for concise path logs or fingerprints. Open-access licensing and code resources (often in Python/PyTorch/TensorFlow) foster reproducibility and rapid integration into wireless ML workflows.
By leveraging digital-twin construction, GPU-accelerated propagation solvers, and standardized annotation across large-scale, highly-parameterizable scenes, synthetic indoor propagation datasets provide critical infrastructure for AI-native wireless research, robust benchmarking, and the development of generalizable, physically-grounded radio environment modeling.