ToCoAD: Two-Stage Contrastive Learning
- The paper introduces a two-stage training paradigm that leverages synthetic anomaly generation and bootstrap contrastive fine-tuning to bridge the domain gap in industrial anomaly detection.
- The methodology enhances pixel-level defect localization using focal loss and SimSiam-based contrastive learning, achieving state-of-the-art performance on benchmarks such as MVTec AD, VisA, and BTAD.
- The approach refines the feature space by repelling anomalous directions and contracting normal feature clusters, offering robust improvements for industrial inspection tasks.
ToCoAD (Two-Stage Contrastive Learning for Industrial Anomaly Detection) is a training paradigm developed to address the limitations of conventional unsupervised anomaly detection (A/D) in industrial contexts, particularly the domain gap between general-purpose feature extractors and the statistics of industrial data. ToCoAD employs synthetic anomaly generation and a bootstrap contrastive fine-tuning regimen, yielding state-of-the-art pixel-level anomaly localization performance on benchmarks such as MVTec AD, VisA, and BTAD (Liang et al., 2024).
1. Motivation: Bridging the Domain Gap in Industrial Anomaly Detection
Most industrial A/D pipelines rely on frozen feature extractors pretrained on ImageNet. However, industrial defects are often subtle and domain-specific, resulting in "normal" and "anomalous" image patches that diverge significantly from standard ImageNet feature manifolds. Such domain gaps severely hamper the discrimination of subtle defect cues; frozen representations typically fail to generalize to the nuanced statistical patterns of industrial samples. ToCoAD addresses this by (i) bootstrapping a discriminative prior using self-supervised synthetic anomalies, and (ii) explicitly repelling anomalous directions during joint contrastive fine-tuning, thus morphing the feature space to be more sensitive and generalizable to industrial anomaly types.
2. Stage I: Discriminative Network Pre-training via Synthetic Anomalies
2.1 Synthetic Anomaly Generation
The synthetic anomaly pipeline begins with a defect-free image . Geometric augmentations (random rotations from ) are applied to enrich invariance. An anomaly mask is generated using Perlin noise (alternatives: CutPaste, Poisson-seamless patches as in NSA, AugMix style blended anomalies). Optionally, a reference texture patch from datasets like DTD is masked and pasted onto to form the synthetic-anomalous image and dense ground-truth mask .
2.2 Network Architecture and Training Objective
- Feature extractor : frozen WideResNet-50, pretrained on ImageNet.
- Decoding head : symmetric (inverse WideResNet) with UNet-style skip-connections, outputting defect mask .
- Objective: pixel-wise focal loss to overcome severe class imbalance:
where is predicted probability for class per pixel, is the balancing factor, and modulates down-weighting of easy examples.
3. Stage II: Negative-Guided Bootstrap Contrastive Learning
3.1 Architecture and Negative-Feature Guidance
Stage II fixes weights of , and unfreezes for fine-tuning. A SimSiam-style contrastive module is stacked:
- Projector: 3-layer MLP.
- Predictor: 1-layer MLP. Feature maps are extracted from conv3 and conv4 layers for patchwise robustness.
For synthetic anomaly images , the pretrained estimates defect masks , which function as explicit negative guides. Focal loss penalizes feature assignments where mistakenly recognize anomalous features as normal.
3.2 Contrastive Loss and Bootstrap Negative Loss
- Positive pairs: produced via random augmentations of a normal image, features projected to and predicted as .
- Negative "pseudo-pairs": direct negative pairing is avoided; instead, model penalizes cases where detects anomaly features assigned as normal.
Contrastive similarity (SimSiam) loss:
For augmented views, loss is averaged:
The negative bootstrap loss is focal loss on ’s mask of synthetic anomalies:
Total stage II loss:
where balances contrastive tightness against defect-feature repulsion.
4. Training Algorithm and Key Hyperparameters
Training proceeds sequentially across both stages:
- Stage I: Freeze , initialize . For each epoch (), batches of normals are synthetically corrupted, learns to predict masks using Adam (, decay at epochs 80/90).
- Stage II: Freeze , jointly fine-tune and over epochs ( for MVTec AD, for BTAD) with SGD (, , cosine-annealed ).
Typical batch size is 16, focal loss parameters , ; best performance observed for in ablation. Synthetic-anomaly generator: Perlin noise performs best, followed by CutPaste and NSA (Liang et al., 2024). SimSiam used for contrastive module, no temperature parameter () required.
5. Anomaly Scoring and Inference
Patchwise features from layer 2 & 3 (after fine-tuning) are used to construct a training memory bank , reduced by greedy coreset selection (10%) per [Sener & Savarese '17]. Inference per test image :
- Slide overlapping patch grid, extract patch features .
- Find nearest coreset feature .
- Compute raw anomaly score .
- Reweight confidence using neighboring features:
- Patchwise scores are upsampled to full resolution for pixel-level maps (bilinear/optional Gaussian smoothing). Image-level score is .
6. Experimental Results and Ablation Studies
6.1 Datasets, Metrics, and Evaluation
- MVTec AD: 15 classes, 3629 train / 1725 test images, pixel-level AUROC.
- VisA: 9,621 normals, 1,200 anomalous, 12 subsets, pixel-level AUROC.
- BTAD: Three product categories, ~1,800 train / ~1,000 test, pixel-level AUROC.
6.2 Quantitative Results
| Dataset | Image AUROC (%) | Pixel AUROC (%) | Method Comparison |
|---|---|---|---|
| MVTec AD | 99.10 | 98.21 | ToCoAD (Perlin) best overall |
| VisA | 95.35 | 98.43 | ToCoAD best pixel |
| BTAD | - | 97.70 | ToCoAD best |
- Layer selection: best results from conv3+conv4.
- Necessity of two-stage regime: only full regimen achieves top performance (~98.21% pixel-level), outperforming no-pretrain and "CL-only".
- Synthetic anomaly diversity: Perlin noise generator yields richer defects, outperforming CutPaste and NSA.
- SimSiam preferred over BYOL.
7. Insights, Limitations, and Research Directions
Stage I enables the network to generalize the concept of "anomaly" over a broad span of synthetic defect patterns, thus cultivating a defect prior. The stage II negative-guided bootstrap constrains the feature extractor to diverge from anomaly directions, while the contrastive objective contracts the normal feature locus, substantially minimizing overlap of normal and anomaly manifolds.
Current limitations include restricted diversity of synthetic anomaly generators; real defects may reside outside their support. Memory-bank compactness and selection, while effective, could benefit from learned or dynamic dictionaries. Extension to few-shot or semi-supervised anomaly detection via leveraging limited real anomaly samples is suggested as a plausible avenue for robustness. Stronger augmentations (e.g., AugMix), or adversarial synthetic-defect generation, may further improve generalizability.
In sum, ToCoAD establishes a robust framework bridging the typical ImageNet-to-industrial data gap by sequentially bootstrapping a discriminative defect prior and leveraging explicit negative guidance in contrastive backbone refinement, resulting in elevated anomaly detection performance across multiple industrial benchmarks (Liang et al., 2024).