SuperSimpleNet: Efficient Defect Detection
- SuperSimpleNet is a discriminative convolutional neural network architecture that unifies surface defect detection across unsupervised, weakly supervised, mixed, and fully supervised settings.
- It incorporates latent-space synthetic anomaly generation and a dual-head design for segmentation and classification to robustly address diverse annotation challenges.
- The model achieves state-of-the-art accuracy with sub-10 ms inference latency, making it ideal for real-time industrial quality control.
SuperSimpleNet is a discriminative convolutional neural network architecture designed for efficient, high-accuracy surface defect detection across all supervision regimes: unsupervised, weakly supervised, mixed supervision, and fully supervised learning. Developed as an extension of SimpleNet, it introduces latent-space synthetic anomaly generation, a dual-head design (segmentation and classification), and a unified training procedure to accommodate diverse annotation scenarios encountered in industrial quality inspection. SuperSimpleNet achieves state-of-the-art accuracy and sub-10 ms inference latency, operationalizing a single architecture and codepath across the full spectrum of manufacturing data annotation regimes (Rolih et al., 26 Aug 2025, Rolih et al., 2024).
1. Network Architecture
SuperSimpleNet employs a WideResNet-50 backbone pretrained on ImageNet as a frozen encoder. Intermediate feature maps from layers 2 and 3, denoted and , are bilinearly upsampled (, ) to a common maximum spatial resolution . These upsampled features, , are concatenated channel-wise to form . Local spatial context is encoded via average pooling (stride 1), producing .
A 11 convolution adaptor then projects 0 to a latent representation 1 optimized for pixel-wise segmentation. During training, both 2 and 3 undergo latent-space synthetic anomaly injection (see Section 2).
The architecture comprises two heads:
- Segmentation head (4): A 353 convolution plus batch normalization (no activation), yielding a single-channel anomaly score map 6.
- Classification head (7): During training, concatenates perturbed feature maps 8, processes with a 595 convolutional block (conv–BN–ReLU), applies global max-pooling to both the conv output and 0, concatenates the result, and maps it through a fully connected layer to a scalar anomaly score 1, followed by sigmoid for anomaly probability. At inference, unperturbed 2 is used.
Apart from the frozen backbone, the model adds approximately 2 million trainable parameters. The anomaly synthesis module is only active during training; inference proceeds with a single forward pass through the backbone and two lightweight heads (Rolih et al., 26 Aug 2025, Rolih et al., 2024).
2. Latent-Space Synthetic Anomaly Generation
To address label scarcity and enable effective training under any supervision regime, SuperSimpleNet synthesizes anomalies directly in internal feature maps 3 and 4. This is accomplished by the following procedure:
- Generate Perlin noise 5 and threshold at 6 to create a binary anomaly region mask 7: 8 if 9, else 0.
- In mixed/fully supervised modes, 1 for smaller anomalies; in weak/unsupervised, 2 is dataset-dependent (3).
- Remove real-anomaly pixels using the ground truth: 4. For settings without pixel masks, 5 so 6.
- Draw Gaussian noise 7 (8), masked such that 9 if 0; zero otherwise.
- Perturb features: 1, 2.
- Each batch uses two independent perturbations to stabilize optimization.
This mechanism robustly bridges the domain gap between synthetic training defects and real-world anomalies, especially under scarce or absent pixel-level supervision. It also enables self-training of the segmentation branch in weakly supervised settings (Rolih et al., 26 Aug 2025, Rolih et al., 2024).
3. Dual-Headed Output and Unified Loss Function
The segmentation head outputs the pixel-wise soft anomaly map 3. The classification head produces a scalar image-level anomaly score: 4, where 5 denotes the sigmoid.
Training uses a composite loss:
- Segmentation loss: A truncated L6 loss enforces a soft margin at each pixel,
7
where 8 is the predicted logit, 9 the mask, and threshold 0. Mean over the spatial domain yields 1.
- A focal loss, 2, is applied for class imbalance to both segmentation and classification outputs.
- The aggregated loss is:
3
where 4 for fully/pixel-labeled images and 5 for weakly labeled anomalies.
- Pixel-wise weights from a distance transform boost central anomaly pixels.
This loss unifies all annotation scenarios and automatically adapts as label granularity varies across the dataset (Rolih et al., 26 Aug 2025, Rolih et al., 2024).
4. Supervision Regimes and Training Paradigms
SuperSimpleNet is the first model to fully support training under unsupervised (defect-free), weakly supervised (image-level only), mixed-supervision (partial pixel masks), and fully supervised (exhaustive masks) regimes in a single architecture:
- Unsupervised: Only normal images; trains on synthetic masked anomalies.
- Weakly supervised: Image-level anomaly tags without pixel masks; segmentation head receives only synthetic masks, classification head uses real/synthetic global labels (6 on anomalies).
- Mixed: Some images with masks; heads train according to available mask/alignment.
- Fully supervised: All anomalies with masks; both heads train on real and synthetic anomalies.
Training uses AdamW (batch size 32, 300 epochs). Learning rates are 7 for heads, 8 for the adaptor, and weight decay 9. The learning rate is reduced by 0 at epochs 240 and 270. Gradients are clipped at norm 1 for stability. Synthetic anomaly injection is active during training only, affecting 50% of training samples (Rolih et al., 26 Aug 2025, Rolih et al., 2024).
5. Experimental Evaluation
SuperSimpleNet was benchmarked on SensumSODF, KSDD2 (supervised), and MVTec AD, VisA (unsupervised), with dataset-specific resolutions. Metrics included image-level AUROC and pixel-level localization (AUPRO, AP1, AP2).
| Regime/Dataset | Detection Metric | Localization Metric | Value |
|---|---|---|---|
| SensumSODF (full sup) | AUROC | AUPRO | 98.0%, 95.8% |
| KSDD2 (full sup) | AP3 | AP4 | 97.8%, 81.3% |
| SensumSODF (weak) | AUROC | AUPRO | 97.4%, 92.8% |
| MVTec AD (unsup) | AUROC | AUPRO | 98.3%, 91.2% |
| VisA (unsup) | AUROC | AUPRO | 93.6%, 87.4% |
SuperSimpleNet is the only method simultaneously achieving sub-10 ms latency (9.5 ms per 256×256 image on a V100S GPU) and supporting all four supervision settings. Model parameter count is ≈34M (dominated by the backbone) and inference memory usage is ~0.5 GB at standard resolution (Rolih et al., 26 Aug 2025).
6. Deployment, Applicability, and Limitations
The architecture is optimized for speed and deployment: the backbone is frozen during inference; the anomaly synthesis module is disabled; only two small heads remain active. This offers a single code path across all annotation regimes—eliminating generative back-projection or memory-bank lookups, and supporting real-time applications (262 fps at 256×256 inputs).
Key industrial advantages include:
- Efficiency: Minimal inference latency, low memory/compute footprint.
- Adaptability: Handles transitions between unsupervised startup, incremental label acquisition, and fully annotated datasets seamlessly.
- Robustness: Latent-space anomaly module bridges annotation gaps, facilitating continuous self-training of the segmentation head (Rolih et al., 26 Aug 2025).
Limitations include reliance on pretrained feature quality, need for backbone-dependent tuning of synthetic noise, and reduced localization accuracy for very small defects (<1% area) without high-res input. The unsupervised setting can also underperform on images with multiple distinct objects (Rolih et al., 2024).
7. Comparative Analysis and Impact
Ablation studies demonstrate that each architectural innovation—feature upscaling, latent anomaly injection, classification head—contributes significantly to aggregate performance. Omitting synthetic anomalies, upscaling, or the classification head degrades detection/localization AUROC and AUPRO by 1–4 percentage points on average.
By unifying training and inference across all supervision settings without loss of performance or speed, SuperSimpleNet establishes a new operational standard for industrial surface defect detection (Rolih et al., 26 Aug 2025, Rolih et al., 2024).