SuperSimpleNet: Efficient Defect Detection

Updated 4 May 2026

SuperSimpleNet is a discriminative convolutional neural network architecture that unifies surface defect detection across unsupervised, weakly supervised, mixed, and fully supervised settings.
It incorporates latent-space synthetic anomaly generation and a dual-head design for segmentation and classification to robustly address diverse annotation challenges.
The model achieves state-of-the-art accuracy with sub-10 ms inference latency, making it ideal for real-time industrial quality control.

SuperSimpleNet is a discriminative convolutional neural network architecture designed for efficient, high-accuracy surface defect detection across all supervision regimes: unsupervised, weakly supervised, mixed supervision, and fully supervised learning. Developed as an extension of SimpleNet, it introduces latent-space synthetic anomaly generation, a dual-head design (segmentation and classification), and a unified training procedure to accommodate diverse annotation scenarios encountered in industrial quality inspection. SuperSimpleNet achieves state-of-the-art accuracy and sub-10 ms inference latency, operationalizing a single architecture and codepath across the full spectrum of manufacturing data annotation regimes (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

1. Network Architecture

SuperSimpleNet employs a WideResNet-50 backbone pretrained on ImageNet as a frozen encoder. Intermediate feature maps from layers 2 and 3, denoted $f_2 \in \mathbb{R}^{C_2 \times H_2 \times W_2}$ and $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ , are bilinearly upsampled ( $f_3 \times 4$ , $f_2 \times 2$ ) to a common maximum spatial resolution $(H_0, W_0)$ . These upsampled features, $F_2, F_3$ , are concatenated channel-wise to form $\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}$ . Local spatial context is encoded via $3 \times 3$ average pooling (stride 1), producing $F = \mathrm{AvgPool}_{3\times3}(\hat{F})$ .

A 1 $\times$ 1 convolution adaptor then projects $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 0 to a latent representation $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 1 optimized for pixel-wise segmentation. During training, both $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 2 and $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 3 undergo latent-space synthetic anomaly injection (see Section 2).

The architecture comprises two heads:

Segmentation head ( $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 4): A 3 $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 53 convolution plus batch normalization (no activation), yielding a single-channel anomaly score map $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 6.
Classification head ( $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 7): During training, concatenates perturbed feature maps $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 8, processes with a 5 $f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}$ 95 convolutional block (conv–BN–ReLU), applies global max-pooling to both the conv output and $f_3 \times 4$ 0, concatenates the result, and maps it through a fully connected layer to a scalar anomaly score $f_3 \times 4$ 1, followed by sigmoid for anomaly probability. At inference, unperturbed $f_3 \times 4$ 2 is used.

Apart from the frozen backbone, the model adds approximately 2 million trainable parameters. The anomaly synthesis module is only active during training; inference proceeds with a single forward pass through the backbone and two lightweight heads (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

2. Latent-Space Synthetic Anomaly Generation

To address label scarcity and enable effective training under any supervision regime, SuperSimpleNet synthesizes anomalies directly in internal feature maps $f_3 \times 4$ 3 and $f_3 \times 4$ 4. This is accomplished by the following procedure:

Generate Perlin noise $f_3 \times 4$ 5 and threshold at $f_3 \times 4$ 6 to create a binary anomaly region mask $f_3 \times 4$ 7: $f_3 \times 4$ 8 if $f_3 \times 4$ 9, else $f_2 \times 2$ 0.
In mixed/fully supervised modes, $f_2 \times 2$ 1 for smaller anomalies; in weak/unsupervised, $f_2 \times 2$ 2 is dataset-dependent ( $f_2 \times 2$ 3).
Remove real-anomaly pixels using the ground truth: $f_2 \times 2$ 4. For settings without pixel masks, $f_2 \times 2$ 5 so $f_2 \times 2$ 6.
Draw Gaussian noise $f_2 \times 2$ 7 ( $f_2 \times 2$ 8), masked such that $f_2 \times 2$ 9 if $(H_0, W_0)$ 0; zero otherwise.
Perturb features: $(H_0, W_0)$ 1, $(H_0, W_0)$ 2.
Each batch uses two independent perturbations to stabilize optimization.

This mechanism robustly bridges the domain gap between synthetic training defects and real-world anomalies, especially under scarce or absent pixel-level supervision. It also enables self-training of the segmentation branch in weakly supervised settings (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

3. Dual-Headed Output and Unified Loss Function

The segmentation head outputs the pixel-wise soft anomaly map $(H_0, W_0)$ 3. The classification head produces a scalar image-level anomaly score: $(H_0, W_0)$ 4, where $(H_0, W_0)$ 5 denotes the sigmoid.

Training uses a composite loss:

Segmentation loss: A truncated L $(H_0, W_0)$ 6 loss enforces a soft margin at each pixel,

$(H_0, W_0)$ 7

where $(H_0, W_0)$ 8 is the predicted logit, $(H_0, W_0)$ 9 the mask, and threshold $F_2, F_3$ 0. Mean over the spatial domain yields $F_2, F_3$ 1.

A focal loss, $F_2, F_3$ 2, is applied for class imbalance to both segmentation and classification outputs.
The aggregated loss is:

$F_2, F_3$ 3

where $F_2, F_3$ 4 for fully/pixel-labeled images and $F_2, F_3$ 5 for weakly labeled anomalies.

Pixel-wise weights from a distance transform boost central anomaly pixels.

This loss unifies all annotation scenarios and automatically adapts as label granularity varies across the dataset (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

4. Supervision Regimes and Training Paradigms

SuperSimpleNet is the first model to fully support training under unsupervised (defect-free), weakly supervised (image-level only), mixed-supervision (partial pixel masks), and fully supervised (exhaustive masks) regimes in a single architecture:

Unsupervised: Only normal images; trains on synthetic masked anomalies.
Weakly supervised: Image-level anomaly tags without pixel masks; segmentation head receives only synthetic masks, classification head uses real/synthetic global labels ( $F_2, F_3$ 6 on anomalies).
Mixed: Some images with masks; heads train according to available mask/alignment.
Fully supervised: All anomalies with masks; both heads train on real and synthetic anomalies.

Training uses AdamW (batch size 32, 300 epochs). Learning rates are $F_2, F_3$ 7 for heads, $F_2, F_3$ 8 for the adaptor, and weight decay $F_2, F_3$ 9. The learning rate is reduced by $\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}$ 0 at epochs 240 and 270. Gradients are clipped at norm 1 for stability. Synthetic anomaly injection is active during training only, affecting 50% of training samples (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

5. Experimental Evaluation

SuperSimpleNet was benchmarked on SensumSODF, KSDD2 (supervised), and MVTec AD, VisA (unsupervised), with dataset-specific resolutions. Metrics included image-level AUROC and pixel-level localization (AUPRO, AP $\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}$ 1, AP $\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}$ 2).

Regime/Dataset	Detection Metric	Localization Metric	Value
SensumSODF (full sup)	AUROC	AUPRO	98.0%, 95.8%
KSDD2 (full sup)	AP $\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}$ 3	AP $\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}$ 4	97.8%, 81.3%
SensumSODF (weak)	AUROC	AUPRO	97.4%, 92.8%
MVTec AD (unsup)	AUROC	AUPRO	98.3%, 91.2%
VisA (unsup)	AUROC	AUPRO	93.6%, 87.4%

SuperSimpleNet is the only method simultaneously achieving sub-10 ms latency (9.5 ms per 256×256 image on a V100S GPU) and supporting all four supervision settings. Model parameter count is ≈34M (dominated by the backbone) and inference memory usage is ~0.5 GB at standard resolution (Rolih et al., 26 Aug 2025).

6. Deployment, Applicability, and Limitations

The architecture is optimized for speed and deployment: the backbone is frozen during inference; the anomaly synthesis module is disabled; only two small heads remain active. This offers a single code path across all annotation regimes—eliminating generative back-projection or memory-bank lookups, and supporting real-time applications (262 fps at 256×256 inputs).

Key industrial advantages include:

Efficiency: Minimal inference latency, low memory/compute footprint.
Adaptability: Handles transitions between unsupervised startup, incremental label acquisition, and fully annotated datasets seamlessly.
Robustness: Latent-space anomaly module bridges annotation gaps, facilitating continuous self-training of the segmentation head (Rolih et al., 26 Aug 2025).

Limitations include reliance on pretrained feature quality, need for backbone-dependent tuning of synthetic noise, and reduced localization accuracy for very small defects (<1% area) without high-res input. The unsupervised setting can also underperform on images with multiple distinct objects (Rolih et al., 2024).

7. Comparative Analysis and Impact

Ablation studies demonstrate that each architectural innovation—feature upscaling, latent anomaly injection, classification head—contributes significantly to aggregate performance. Omitting synthetic anomalies, upscaling, or the classification head degrades detection/localization AUROC and AUPRO by 1–4 percentage points on average.

By unifying training and inference across all supervision settings without loss of performance or speed, SuperSimpleNet establishes a new operational standard for industrial surface defect detection (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes (2025)

SuperSimpleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperSimpleNet.

SuperSimpleNet: Efficient Defect Detection

1. Network Architecture

2. Latent-Space Synthetic Anomaly Generation

3. Dual-Headed Output and Unified Loss Function

4. Supervision Regimes and Training Paradigms

5. Experimental Evaluation

6. Deployment, Applicability, and Limitations

7. Comparative Analysis and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SuperSimpleNet: Efficient Defect Detection

1. Network Architecture

2. Latent-Space Synthetic Anomaly Generation

3. Dual-Headed Output and Unified Loss Function

4. Supervision Regimes and Training Paradigms

5. Experimental Evaluation

6. Deployment, Applicability, and Limitations

7. Comparative Analysis and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research