Synthetic Haze Dataset Overview
- Synthetic Haze Dataset is a curated collection of paired clear and artificially hazed images created using controlled, physics-based models.
- It employs diverse synthesis pipelines including classical ASM, game engine rendering, and GAN-based domain adaptation for simulating various haze conditions.
- The dataset enables comprehensive training and benchmarking of dehazing algorithms, addressing the challenges of real-world paired imagery.
A synthetic haze dataset is a curated collection of image pairs consisting of clear (ground-truth) and artificially hazed images, where haze is simulated by a controlled, physics-based model. These datasets are foundational in supervised deep learning for image dehazing, enabling the training and benchmarking of dehazing algorithms in the absence of real-world paired clean-hazy imagery. They span a broad spectrum of scenes, haze types, intensities, and photometric conditions, and are constructed via methodologies that range from analytic image formation models to photorealistic rendering in game engines and advanced domain adaptation pipelines.
1. Physical Basis and Synthesis Pipeline
Synthetic haze datasets are almost universally based on the Koschmieder/Atmospheric Scattering Model (ASM), formalized as:
where is the observed hazy image at pixel , the scene radiance (clear image), the atmospheric light (RGB), the transmission function, the medium scattering coefficient, and the scene depth (Li et al., 2017).
When ground-truth depth is absent, a single-image depth estimator is often used, as in RESIDE’s OTS where the Liu et al. network provides per-pixel depth (Li et al., 2017), or CanDY, Make3D/BSDS (Swami et al., 2018). More recent techniques replace this with metric-accurate depth directly from 3D game engines (SimHaze (Lou et al., 2023), UNREAL-NH (Liu et al., 2023)), or decouple haze from depth (DA-HAZE (Xu et al., 2024)) by shuffling independent depth-image pairs.
Parameterization strategies for and include uniform or truncated-normal sampling within experimentally validated ranges (e.g., , for outdoor traffic scenes (Li et al., 2017), , in HazyDet (Feng et al., 2024)). For multi-density datasets, haze levels are controlled via explicit scheduling or by learned latents in GANs (Zhang et al., 2021).
2. Major Public Synthetic Haze Datasets
The diversity, scale, and methodology of synthetic haze datasets are summarized in the table below:
| Dataset | Generation Method | Size / Splits |
|---|---|---|
| RESIDE (Li et al., 2017) | ASM + monocular depth or stereo RGB-D | Indoor: 13,990 / Outdoor: 72,135 / SOTS: 500 |
| SimHaze (Lou et al., 2023) | Unreal Engine, perfect depth + ASM | “Tens of thousands”, urban/park, multi-haze |
| 4K-HAZE (Zheng et al., 2023) | ASM on 4K, CS-Mixer depth + GU-Net GAN | >30,000 4K pairs, day/night |
| UNREAL-NH (Liu et al., 2023) | Unreal Engine 4.27, photometric post-processing | 10,080 pairs (480×480 patches) |
| DA-HAZE (Xu et al., 2024) | ASM; shuffled depth-image, “depth-agnostic” | 313,950 pairs, 3× GSS expansion |
| HazeSpace2M (Islam et al., 2024) | ASM; multiple haze types/intensities | 2,070,600 synthetic hazy, 130,193 GT |
| HazyDet (Feng et al., 2024) | ASM, SOTA monocular depth, trunc-norm A/β | 11,000 synthetic images |
| Nighttime 3R (Zhang et al., 2020) | ASM + empirical color prior | NHR: 8,970, NHC: 2,750×3 |
| Hazy-COCO (Li et al., 2021) | Inverse-MLDCP w/ regression correction | ≈118,000 COCO images, multi-density |
| Gauge Haze (Ramírez-Agudelo et al., 15 Jan 2026) | Unreal Engine 5.1.1, controlled lighting | 4,796 haze + 9,590 smoke |
Key properties include the type of depth input, scene domain (indoor/outdoor), haze density control, and the inclusion of auxiliary data (e.g., depth maps, bounding boxes, mask annotations).
3. Dataset Construction Methodologies
a. Depth-Based ASM Synthesis
Classical approaches (RESIDE, CanDY, HazyDet) synthesize hazy images by applying ASM to in-the-wild imagery, using RGB-D pairs, estimated monocular depth, or stereo disparity. The accuracy of depth estimation is critical; SimHaze demonstrates improvements by avoiding monocular estimators entirely, instead rendering both clean images and ground-truth depth in Unreal Engine (Lou et al., 2023). The canonical pipeline:
- Acquire or estimate .
- Sample parameters , .
- Compute .
- Blend hazy image: .
b. Engine-Based, Fully Synthetic Rendering
Game engine datasets (SimHaze, UNREAL-NH, Gauge Haze) leverage advanced photorealistic rendering to generate high-resolution clean/hazy pairs with accurate depth, illumination, and physically-based fog propagation (Lou et al., 2023, Liu et al., 2023, Ramírez-Agudelo et al., 15 Jan 2026). This approach eliminates scale or pose inconsistencies, supports true 3D haze and occlusion, and enables batch generation of synchronized data.
c. Physically-Informed GAN Domain Adaptation
Recent benchmarks (4K-HAZE) post-process ASM-based synthetic images with a GAN trained to map synthetic haze statistics onto the real domain (using GU-Net), closing the domain gap in pixel distribution without direct paired data (Zheng et al., 2023).
d. Haze-Type and Density Variants
Datasets are now annotated for haze types (fog, cloud, environmental), with separate models for each (HazeSpace2M (Islam et al., 2024)). Type labels derive from controlled ASM parameter sequences plus stylization (Photoshop Neural Filters, cloud overlays).
e. Depth-Agnostic and Content-Style Disentangled Synthesis
DA-HAZE (Xu et al., 2024) disrupts the depth–haze coupling by globally shuffling depth maps before ASM synthesis. Density-aware GANs (DAS (Zhang et al., 2021)) encode haze intensity as style latents, enabling smooth latent interpolation across haze strengths, directly supporting density-aware model evaluation.
4. Dataset Statistics, Splits, and Benchmarking
Dataset sizes range from several thousand to over two million images (HazeSpace2M (Islam et al., 2024)). Standard protocol involves:
- Training: Multiple haze variants per clean image, often covering full , (or haze intensity/type) parameter space.
- Validation: Held-out set, matched distribution.
- Objective testing: Separate, often more “challenging,” synthetic test pairs (SOTS, DA-SOTS, HazyDet synthetic test).
- Real-world evaluation: Distinct benchmark of unpaired real hazy images, occasionally with pseudo-ground-truth or human-labeled bounding boxes (e.g., HazyDet’s RDDTS (Feng et al., 2024)).
Metrics for benchmarking include PSNR, SSIM, BRISQUE, FADE, FID/KID (synthetic/real similarity), mAP (for detection) and task-specific SSIM/accuracy for gauge-reading (Ramírez-Agudelo et al., 15 Jan 2026).
5. Advantages, Challenges, and Limitations
Advantages:
- Synthetic paired data enables full-reference quantitative evaluation (PSNR, SSIM).
- Parameterized control over haze properties facilitates robustness studies.
- Synthetic depth guarantees perfect ASM compliance when using engine rendering (Lou et al., 2023).
Limitations:
- Domain gap persists even with highly realistic simulation (SimHaze, UNREAL-NH), as evidenced by mAP/PSNR/SSIM drops when transferring from synthetic to real haze (Lou et al., 2023, Feng et al., 2024).
- Depth estimation errors (in monocular pipelines) cause local artifacts and over/under-hazing (Li et al., 2017).
- Most pipelines do not model non-uniform, multi-layer, or spectrally varying haze; rare phenomena (e.g., colored smog, thick smoke) may not be captured (Zheng et al., 2023, Ramírez-Agudelo et al., 15 Jan 2026).
- Photometric augmentations are usually limited to haze parameter sweeps; scene/semantic diversity is constrained by the underlying clean set or engine assets.
6. Specialized Variants and Application-Driven Datasets
- Nighttime haze: 3R pipeline (Zhang et al., 2020), UNREAL-NH (Liu et al., 2023) add artificially colored light sources, simulate glow/flare, and employ empirical night illumination priors.
- Detection under haze: HazyDet (Feng et al., 2024), Hazy-COCO (Li et al., 2021) include bounding boxes and depth maps, supporting both dehazing and downstream object detection.
- Fine-grained haze type/level: HazeSpace2M (Islam et al., 2024) spans multiple haze categories and intensities, supporting haze-type classification and specialized dehazing pipelines.
- Industrial/gauge reading: Gauge Haze (Ramírez-Agudelo et al., 15 Jan 2026) addresses photometric restoration in scene-specific environments via engine-generated analog instrument imagery.
7. Usage Guidelines and Future Directions
Synthetic haze datasets should be selected and configured to match the intended deployment domain in scene content, haze type, and density. For cross-domain deployment, depth-agnostic or domain-adapted methods (DA-HAZE (Xu et al., 2024), GAN-adapted (Zheng et al., 2023), style-disentangled (Zhang et al., 2021)) offer improved real-world generalization.
Current research targets more realistic multi-layer/hyperspectral haze simulation, integration of measured meteorological data, improved depth estimation, and jointly training on synthetic plus small-scale real haze (semi-supervised, domain adaptation) (Liu et al., 2023, Zhang et al., 2020). The release of datasets such as HazeSpace2M promises to enable haze-aware vision models operating robustly across real-world atmospheric phenomena (Islam et al., 2024).
References:
- RESIDE (Li et al., 2017)
- CANDY (Swami et al., 2018)
- SimHaze (Lou et al., 2023)
- UNREAL-NH (Liu et al., 2023)
- 4K-HAZE (Zheng et al., 2023)
- HazyDet (Feng et al., 2024)
- DA-HAZE (Xu et al., 2024)
- Nighttime 3R (Zhang et al., 2020)
- HazeSpace2M (Islam et al., 2024)
- Hazy-COCO (Li et al., 2021)
- Gauge Haze (Ramírez-Agudelo et al., 15 Jan 2026)
- Density-aware Synthesis (Zhang et al., 2021)