Soiling Detection in Automotive Cameras

Updated 20 November 2025

Soiling detection for automotive cameras is the automated identification of lens contaminants like mud, dust, and rain to ensure reliable perception in ADAS and autonomous driving.
It employs techniques such as semantic segmentation, tile-based classification, and coverage regression to differentiate opaque, transparent, and semi-transparent soiling with high accuracy.
Key challenges include managing noisy annotations, achieving real-time performance on embedded systems, and using GAN-based augmentation to overcome limited and diverse data scenarios.

Soiling detection for automotive cameras refers to the automated identification and localization of contaminants—including mud, dust, rain droplets, snow, ice, and other environmental pollution—on in-vehicle camera lenses. Accurate detection of lens soiling is critical for ensuring the robustness and reliability of perception algorithms in advanced driver-assistance systems (ADAS) and autonomous driving, given the substantial degradation soiling can introduce to visual scene understanding. The field has evolved rapidly since the first large-scale annotated datasets of automotive fisheye imagery, with recent research focusing on semantic segmentation, tile-based classification, regression of soiling coverage, and advanced data augmentation to address both accuracy and low-latency embedded requirements.

1. Soiling Types, Dataset Construction, and Annotation Protocols

A core foundation of soiling detection research is the availability of annotated datasets that capture the breadth of real-world lens contaminants across multiple cameras and conditions. The WoodScape dataset introduced the first extensive multi-camera fisheye corpus with instance-level soiling annotations: 5,000 images labeled for soiling, equally drawn from front, rear, and lateral surround-view fisheye cameras (Yogamani et al., 2019, Beránek et al., 12 Nov 2025).

Soiling in these datasets is typically partitioned into:

Opaque soiling: Mud, dust, and similar materials that block or heavily scatter light, occluding image regions.
Transparent soiling: Water droplets, thin ice, or residues that refract or blur but do not fully occlude the underlying scene.
Semi-transparent soiling: Strong blur or partial occlusion, where texture is significantly weakened (introduced in later works) (Uricar et al., 2021).

Annotation is performed via closed polygons at the instance level for each soiled patch, supporting both per-pixel segmentation and per-tile or global presence labels. The protocols emphasize per-pixel masks rather than coarse bounding boxes, facilitating high-granularity segmentation (Yogamani et al., 2019, Beránek et al., 12 Nov 2025). Data splits generally follow stratified ratios (e.g., 60% train, 10% val, 30% test in WoodScape), with five randomized splits available to encourage robust evaluation and class balance (Yogamani et al., 2019).

Manual annotation, however, introduces noise due to the subtlety of soiling boundaries—especially between transparent and semi-transparent classes. Ensemble pseudo-labeling frameworks have been proposed to refine these noisy labels (see Section 3) (Uricar et al., 2021).

2. Algorithmic Approaches: Classification, Regression, Segmentation

Early methods employed tile-wise or global classification via compact convolutional networks, whereby the input image was partitioned into a regular grid (e.g., $20 \times 12$ of $64 \times 64$ tiles), and each tile was assigned a dominant soiling label (Uricar et al., 2019, Das, 2019). The SoilingNet and SoildNet architectures were embedded within multi-task frameworks and demonstrated that per-tile detection could enable partial system failover, ignoring or masking only contaminated regions in the frame (Uricar et al., 2019, Das, 2019).

Coverage regression was introduced in TiledSoilingNet, arguing that co-occurrence of multiple soiling types within a tile is frequent. This approach regresses the fractional area of each soiling type in each tile, yielding a multi-output regression head (Das et al., 2020): $g_{i,c} = \frac{1}{|\mathcal{T}_i|}\,|\{\,p\in\mathcal{T}_i\;|\;c(p)=c\,\}|$ for tile $\mathcal{T}_i$ and soiling class $c$ , supervised with mean square error. This allows more informative downstream decisions and is efficient in deployment ( $\sim2$ ms per frame on automotive SoC, $10\times$ faster than full segmentation).

Semantic segmentation has become the prevailing paradigm, treating soiling detection as a dense per-pixel four-class semantic segmentation problem (classes: clear, transparent, semi-transparent, opaque) (Beránek et al., 12 Nov 2025). Numerous established encoder–decoder architectures (U-Net, U-Net++, DeepLab V3, DeepLab V3+, FPN, PSPNet, MaNet, LinkNet, PAN) have been benchmarked directly on soiling detection, with pixel-wise categorical cross-entropy as the main loss function (Beránek et al., 12 Nov 2025). This transition has empirically improved pixel accuracy above previous tile-based baselines.

A concise comparison is captured in the table below:

Architecture Type	Output	Loss	Typical Accuracy (%)	Reference
Tile classification	Tile class	CE or RMSE	87.4 (tile, 4x4)	(Das et al., 2020)
Coverage regression	Tile coverages	MSE	RMSE: 0.10–0.11	(Das et al., 2020)
Full segmentation	Per-pixel mask	CE	91.2–94.0	(Beránek et al., 12 Nov 2025)

3. Handling Noisy Annotations and Data Quality

Manual polygonal annotations are inherently noisy, particularly between visually ambiguous categories. A robust semi-supervised ensemble strategy has been established (Uricar et al., 2021):

Ensemble pseudo-labeling: Multiple noisy “pseudo-labels” (manual polygons + deep segmentation outputs with varied backbones and test-time augmentation) are aggregated.
Two-stage network distillation:
- Stage 1: Train on random pseudo-labels per image.
- Stage 2: Freeze network, then for each image select the pseudo-label closest (in CE loss) to the consensus output.
- Refined ensemble masks replace raw manual labels for downstream training.

This yields up to +15 percentage point (pp) IoU improvements for transparent and semi-transparent soiling categories over models trained only on manual polygons (Uricar et al., 2021).

Recent analyses discovered data leakage in WoodScape, with temporally-sequential frames from the same scene split between training and test. This was addressed by sequence-based splits, avoiding cross-contamination (Beránek et al., 12 Nov 2025). Further, subsets where images with imprecise or inconsistent annotation are excluded (“Correct clear”) retain full model accuracy (>94%) while halving training set size, emphasizing the criticality of annotation hygiene (Beránek et al., 12 Nov 2025).

4. Network Design, Loss Functions, and Embedded Deployment

Network architectures have been optimized for both accuracy and resource constraints:

SoildNet employs dynamic group convolution and channel shuffling for 7.5× reduction in parameters and ∼6× lower compute, achieving macro-average F1 = 82.1% and sustaining >60 FPS on automotive SoC (Das, 2019).
TiledSoilingNet’s coverage decoder is over an order of magnitude faster than FCN-based segmentation decoders, with 2 ms inference time at $\sim$ 500 mW vs. 20 ms at $\sim$ 5 W (Das et al., 2020).
State-of-the-art segmentation models (e.g., FPNet+ResNet50) achieve up to 94.5% pixel accuracy on the public WoodScape data, with early stopping due to rapid convergence even on reduced datasets (Beránek et al., 12 Nov 2025).

Losses evaluated include categorical cross-entropy, Dice, focal loss, and pixelwise RMSE (the latter advocated in some industrial contexts (Das et al., 2020)), with categorical cross-entropy consistently yielding optimal or near-optimal results in empirical comparisons (Beránek et al., 12 Nov 2025). The addition of geometric input (e.g., camera tensors encoding intrinsic distortion) as in OmniDet further boosts cross-view generalization in unified multi-task settings (Kumar et al., 2021).

Feature-based approaches relying on handcrafted local statistics (MSCN coefficients, Laplacian, contrast, etc.) followed by SVM classification have competitive binary soiling detection accuracy (95–99%) on limited data and are several orders of magnitude lighter than deep models, but lack granularity for pixel-precise segmentation or per-type coverage (Bauer, 2023).

5. Data Augmentation, Generalization, and Robustness

Soiling events are relatively rare, and synthetic data generation is essential. GAN-based augmentation is highly effective:

CycleGAN converts clean to soiled images, and, combined with mask-guided VAE “DirtyGAN,” enables controlled synthesis of novel soiling patterns (Uricar et al., 2019).
Synthetic–real data mixing yields a +17.8% absolute accuracy gain (from 73.95% to 91.71%) in per-pixel detection compared to real-data-only training (Uricar et al., 2019).
GAN-augmented pipelines generalize to standard automotive (Cityscapes) scenes, with soiling-induced mIoU drops of 11–22 points for segmentation tasks, highlighting substantive downstream impact. Training on soiled data recovers the majority of the accuracy drop (Uricar et al., 2019).

Careful pairing of real and synthetic data is necessary to avoid distributional artifacts. DirtyGAN requires mask priors for effective VAE embedding; standard CycleGANs may modify backgrounds undesirably. Control over spatio-temporal soiling dynamics (e.g., moving water drops) remains a future challenge (Uricar et al., 2019).

6. Key Challenges, Limitations, and Open Problems

Automotive soiling detection must contend with:

Sensor–environment interaction: Fisheye radial distortion leads to strong spatial variance in soiling appearance (Yogamani et al., 2019). Illumination changes cause specular and refractive ambiguities (e.g., water/shimmer vs. sky).
Annotation ambiguity: Boundaries between transparent, semi-transparent, and opaque remain subjective; precision limits model performance unless ensemble or human-in-the-loop refinement is applied (Uricar et al., 2021).
Computational constraints: Real-time performance on SoC is mandatory—compact architectures (group conv, coverage regression) dominate for embedded deployment (Das, 2019, Das et al., 2020).
Generalization and bias: Dataset bias (limited soiling patterns/weather), class imbalance, and overfitting present open problems. Periodic retraining or domain adaptation is recommended for fielded systems (Bauer, 2023).
Integration with perception stacks: Soiling masks must be integrated to trigger cleaning, re-weight confidence maps, and ensure selective disabling of contaminated vision channels.

No fisheye-specific convolution layers have yet been widely adopted, though several works advocate distortion-aware modules and geometry-encoded feature maps as future research directions (Yogamani et al., 2019, Kumar et al., 2021).

7. Comparative Assessment and Recommendations

Semantic segmentation models (e.g., FPNet, U-Net++) outperform tile-level and feature-based baselines for pixel-accurate four-class soiling detection (94.0–94.5% accuracy) on the public WoodScape (Beránek et al., 12 Nov 2025).
Tile-based classification and coverage regression offer fast, localized soiling assessment and are beneficial for edge-compute deployment or for partial-system failover (Das et al., 2020, Das, 2019).
Ensemble-label refinement and curated data splits mitigate annotation noise, with up to +15 pp mean IoU improvements in ambiguous classes (Uricar et al., 2021, Beránek et al., 12 Nov 2025). Removing inconsistent masks can halve training time without loss of accuracy.
GAN-based augmentation is a critical enabler for robust training against rare soiling types, with >17% accuracy gains over real-only models and clear improvement to downstream perception robustness (Uricar et al., 2019).

Best-practice recommendations are: apply curated per-pixel semantic segmentation models with cross-entropy loss, leverage ensemble or pseudo-label refinement, and include GAN-based synthetic data where training data is limited in diversity or volume (Beránek et al., 12 Nov 2025, Uricar et al., 2021, Uricar et al., 2019). For embedded deployment, coverage regression or compact group-convolution architectures enable low-latency inference (Das, 2019, Das et al., 2020).

References:

(Yogamani et al., 2019): "WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving"
(Uricar et al., 2019): "SoilingNet: Soiling Detection on Automotive Surround-View Cameras"
(Das, 2019): "SoildNet: Soiling Degradation Detection in Autonomous Driving"
(Das et al., 2020): "TiledSoilingNet: Tile-level Soiling Detection on Automotive Surround-view Cameras Using Coverage Metric"
(Kumar et al., 2021): "OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving"
(Uricar et al., 2021): "Ensemble-based Semi-supervised Learning to Improve Noisy Soiling Annotations in Autonomous Driving"
(Bauer, 2023): "A Feature-based Approach for the Recognition of Image Quality Degradation in Automotive Applications"
(Beránek et al., 12 Nov 2025): "Soiling detection for Advanced Driver Assistance Systems"
(Uricar et al., 2019): "Let's Get Dirty: GAN Based Data Augmentation for Camera Lens Soiling Detection in Autonomous Driving"