Papers
Topics
Authors
Recent
Search
2000 character limit reached

SuperSimpleNet: Efficient Defect Detection

Updated 4 May 2026
  • SuperSimpleNet is a discriminative convolutional neural network architecture that unifies surface defect detection across unsupervised, weakly supervised, mixed, and fully supervised settings.
  • It incorporates latent-space synthetic anomaly generation and a dual-head design for segmentation and classification to robustly address diverse annotation challenges.
  • The model achieves state-of-the-art accuracy with sub-10 ms inference latency, making it ideal for real-time industrial quality control.

SuperSimpleNet is a discriminative convolutional neural network architecture designed for efficient, high-accuracy surface defect detection across all supervision regimes: unsupervised, weakly supervised, mixed supervision, and fully supervised learning. Developed as an extension of SimpleNet, it introduces latent-space synthetic anomaly generation, a dual-head design (segmentation and classification), and a unified training procedure to accommodate diverse annotation scenarios encountered in industrial quality inspection. SuperSimpleNet achieves state-of-the-art accuracy and sub-10 ms inference latency, operationalizing a single architecture and codepath across the full spectrum of manufacturing data annotation regimes (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

1. Network Architecture

SuperSimpleNet employs a WideResNet-50 backbone pretrained on ImageNet as a frozen encoder. Intermediate feature maps from layers 2 and 3, denoted f2∈RC2×H2×W2f_2 \in \mathbb{R}^{C_2 \times H_2 \times W_2} and f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}, are bilinearly upsampled (f3×4f_3 \times 4, f2×2f_2 \times 2) to a common maximum spatial resolution (H0,W0)(H_0, W_0). These upsampled features, F2,F3F_2, F_3, are concatenated channel-wise to form F^∈R(C2+C3)×H0×W0\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}. Local spatial context is encoded via 3×33 \times 3 average pooling (stride 1), producing F=AvgPool3×3(F^)F = \mathrm{AvgPool}_{3\times3}(\hat{F}).

A 1×\times1 convolution adaptor then projects f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}0 to a latent representation f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}1 optimized for pixel-wise segmentation. During training, both f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}2 and f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}3 undergo latent-space synthetic anomaly injection (see Section 2).

The architecture comprises two heads:

  • Segmentation head (f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}4): A 3f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}53 convolution plus batch normalization (no activation), yielding a single-channel anomaly score map f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}6.
  • Classification head (f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}7): During training, concatenates perturbed feature maps f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}8, processes with a 5f3∈RC3×H3×W3f_3 \in \mathbb{R}^{C_3 \times H_3 \times W_3}95 convolutional block (conv–BN–ReLU), applies global max-pooling to both the conv output and f3×4f_3 \times 40, concatenates the result, and maps it through a fully connected layer to a scalar anomaly score f3×4f_3 \times 41, followed by sigmoid for anomaly probability. At inference, unperturbed f3×4f_3 \times 42 is used.

Apart from the frozen backbone, the model adds approximately 2 million trainable parameters. The anomaly synthesis module is only active during training; inference proceeds with a single forward pass through the backbone and two lightweight heads (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

2. Latent-Space Synthetic Anomaly Generation

To address label scarcity and enable effective training under any supervision regime, SuperSimpleNet synthesizes anomalies directly in internal feature maps f3×4f_3 \times 43 and f3×4f_3 \times 44. This is accomplished by the following procedure:

  • Generate Perlin noise f3×4f_3 \times 45 and threshold at f3×4f_3 \times 46 to create a binary anomaly region mask f3×4f_3 \times 47: f3×4f_3 \times 48 if f3×4f_3 \times 49, else f2×2f_2 \times 20.
  • In mixed/fully supervised modes, f2×2f_2 \times 21 for smaller anomalies; in weak/unsupervised, f2×2f_2 \times 22 is dataset-dependent (f2×2f_2 \times 23).
  • Remove real-anomaly pixels using the ground truth: f2×2f_2 \times 24. For settings without pixel masks, f2×2f_2 \times 25 so f2×2f_2 \times 26.
  • Draw Gaussian noise f2×2f_2 \times 27 (f2×2f_2 \times 28), masked such that f2×2f_2 \times 29 if (H0,W0)(H_0, W_0)0; zero otherwise.
  • Perturb features: (H0,W0)(H_0, W_0)1, (H0,W0)(H_0, W_0)2.
  • Each batch uses two independent perturbations to stabilize optimization.

This mechanism robustly bridges the domain gap between synthetic training defects and real-world anomalies, especially under scarce or absent pixel-level supervision. It also enables self-training of the segmentation branch in weakly supervised settings (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

3. Dual-Headed Output and Unified Loss Function

The segmentation head outputs the pixel-wise soft anomaly map (H0,W0)(H_0, W_0)3. The classification head produces a scalar image-level anomaly score: (H0,W0)(H_0, W_0)4, where (H0,W0)(H_0, W_0)5 denotes the sigmoid.

Training uses a composite loss:

  • Segmentation loss: A truncated L(H0,W0)(H_0, W_0)6 loss enforces a soft margin at each pixel,

(H0,W0)(H_0, W_0)7

where (H0,W0)(H_0, W_0)8 is the predicted logit, (H0,W0)(H_0, W_0)9 the mask, and threshold F2,F3F_2, F_30. Mean over the spatial domain yields F2,F3F_2, F_31.

  • A focal loss, F2,F3F_2, F_32, is applied for class imbalance to both segmentation and classification outputs.
  • The aggregated loss is:

F2,F3F_2, F_33

where F2,F3F_2, F_34 for fully/pixel-labeled images and F2,F3F_2, F_35 for weakly labeled anomalies.

  • Pixel-wise weights from a distance transform boost central anomaly pixels.

This loss unifies all annotation scenarios and automatically adapts as label granularity varies across the dataset (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

4. Supervision Regimes and Training Paradigms

SuperSimpleNet is the first model to fully support training under unsupervised (defect-free), weakly supervised (image-level only), mixed-supervision (partial pixel masks), and fully supervised (exhaustive masks) regimes in a single architecture:

  • Unsupervised: Only normal images; trains on synthetic masked anomalies.
  • Weakly supervised: Image-level anomaly tags without pixel masks; segmentation head receives only synthetic masks, classification head uses real/synthetic global labels (F2,F3F_2, F_36 on anomalies).
  • Mixed: Some images with masks; heads train according to available mask/alignment.
  • Fully supervised: All anomalies with masks; both heads train on real and synthetic anomalies.

Training uses AdamW (batch size 32, 300 epochs). Learning rates are F2,F3F_2, F_37 for heads, F2,F3F_2, F_38 for the adaptor, and weight decay F2,F3F_2, F_39. The learning rate is reduced by F^∈R(C2+C3)×H0×W0\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}0 at epochs 240 and 270. Gradients are clipped at norm 1 for stability. Synthetic anomaly injection is active during training only, affecting 50% of training samples (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

5. Experimental Evaluation

SuperSimpleNet was benchmarked on SensumSODF, KSDD2 (supervised), and MVTec AD, VisA (unsupervised), with dataset-specific resolutions. Metrics included image-level AUROC and pixel-level localization (AUPRO, APF^∈R(C2+C3)×H0×W0\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}1, APF^∈R(C2+C3)×H0×W0\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}2).

Regime/Dataset Detection Metric Localization Metric Value
SensumSODF (full sup) AUROC AUPRO 98.0%, 95.8%
KSDD2 (full sup) APF^∈R(C2+C3)×H0×W0\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}3 APF^∈R(C2+C3)×H0×W0\hat{F} \in \mathbb{R}^{(C_2 + C_3) \times H_0 \times W_0}4 97.8%, 81.3%
SensumSODF (weak) AUROC AUPRO 97.4%, 92.8%
MVTec AD (unsup) AUROC AUPRO 98.3%, 91.2%
VisA (unsup) AUROC AUPRO 93.6%, 87.4%

SuperSimpleNet is the only method simultaneously achieving sub-10 ms latency (9.5 ms per 256×256 image on a V100S GPU) and supporting all four supervision settings. Model parameter count is ≈34M (dominated by the backbone) and inference memory usage is ~0.5 GB at standard resolution (Rolih et al., 26 Aug 2025).

6. Deployment, Applicability, and Limitations

The architecture is optimized for speed and deployment: the backbone is frozen during inference; the anomaly synthesis module is disabled; only two small heads remain active. This offers a single code path across all annotation regimes—eliminating generative back-projection or memory-bank lookups, and supporting real-time applications (262 fps at 256×256 inputs).

Key industrial advantages include:

  • Efficiency: Minimal inference latency, low memory/compute footprint.
  • Adaptability: Handles transitions between unsupervised startup, incremental label acquisition, and fully annotated datasets seamlessly.
  • Robustness: Latent-space anomaly module bridges annotation gaps, facilitating continuous self-training of the segmentation head (Rolih et al., 26 Aug 2025).

Limitations include reliance on pretrained feature quality, need for backbone-dependent tuning of synthetic noise, and reduced localization accuracy for very small defects (<1% area) without high-res input. The unsupervised setting can also underperform on images with multiple distinct objects (Rolih et al., 2024).

7. Comparative Analysis and Impact

Ablation studies demonstrate that each architectural innovation—feature upscaling, latent anomaly injection, classification head—contributes significantly to aggregate performance. Omitting synthetic anomalies, upscaling, or the classification head degrades detection/localization AUROC and AUPRO by 1–4 percentage points on average.

By unifying training and inference across all supervision settings without loss of performance or speed, SuperSimpleNet establishes a new operational standard for industrial surface defect detection (Rolih et al., 26 Aug 2025, Rolih et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperSimpleNet.