Papers
Topics
Authors
Recent
2000 character limit reached

ToCoAD: Two-Stage Contrastive Learning

Updated 6 January 2026
  • The paper introduces a two-stage training paradigm that leverages synthetic anomaly generation and bootstrap contrastive fine-tuning to bridge the domain gap in industrial anomaly detection.
  • The methodology enhances pixel-level defect localization using focal loss and SimSiam-based contrastive learning, achieving state-of-the-art performance on benchmarks such as MVTec AD, VisA, and BTAD.
  • The approach refines the feature space by repelling anomalous directions and contracting normal feature clusters, offering robust improvements for industrial inspection tasks.

ToCoAD (Two-Stage Contrastive Learning for Industrial Anomaly Detection) is a training paradigm developed to address the limitations of conventional unsupervised anomaly detection (A/D) in industrial contexts, particularly the domain gap between general-purpose feature extractors and the statistics of industrial data. ToCoAD employs synthetic anomaly generation and a bootstrap contrastive fine-tuning regimen, yielding state-of-the-art pixel-level anomaly localization performance on benchmarks such as MVTec AD, VisA, and BTAD (Liang et al., 2024).

1. Motivation: Bridging the Domain Gap in Industrial Anomaly Detection

Most industrial A/D pipelines rely on frozen feature extractors pretrained on ImageNet. However, industrial defects are often subtle and domain-specific, resulting in "normal" and "anomalous" image patches that diverge significantly from standard ImageNet feature manifolds. Such domain gaps severely hamper the discrimination of subtle defect cues; frozen representations typically fail to generalize to the nuanced statistical patterns of industrial samples. ToCoAD addresses this by (i) bootstrapping a discriminative prior using self-supervised synthetic anomalies, and (ii) explicitly repelling anomalous directions during joint contrastive fine-tuning, thus morphing the feature space to be more sensitive and generalizable to industrial anomaly types.

2. Stage I: Discriminative Network Pre-training via Synthetic Anomalies

2.1 Synthetic Anomaly Generation

The synthetic anomaly pipeline begins with a defect-free image II. Geometric augmentations (random rotations from {0°,90°,180°,270°}\{0°,90°,180°,270°\}) are applied to enrich invariance. An anomaly mask PP is generated using Perlin noise (alternatives: CutPaste, Poisson-seamless patches as in NSA, AugMix style blended anomalies). Optionally, a reference texture patch AA from datasets like DTD is masked and pasted onto II to form the synthetic-anomalous image IGI_G and dense ground-truth mask yG{0,1}H×Wy_G \in \{0,1\}^{H \times W}.

2.2 Network Architecture and Training Objective

  • Feature extractor FF: frozen WideResNet-50, pretrained on ImageNet.
  • Decoding head DD: symmetric (inverse WideResNet) with UNet-style skip-connections, outputting defect mask y^GRH×W×2\hat{y}_G \in \mathbb{R}^{H \times W \times 2}.
  • Objective: pixel-wise focal loss to overcome severe class imbalance:

Ldisc=Lfocal(y^G,yG)=t{0,1}αt(1pt)γlogpt\mathcal{L}_{\mathrm{disc}} = \mathcal{L}_{\mathrm{focal}}(\hat{y}_G, y_G) = -\sum_{t \in \{0,1\}} \alpha_t (1 - p_t)^\gamma \log p_t

where ptp_t is predicted probability for class tt per pixel, αt\alpha_t is the balancing factor, and γ>0\gamma > 0 modulates down-weighting of easy examples.

3. Stage II: Negative-Guided Bootstrap Contrastive Learning

3.1 Architecture and Negative-Feature Guidance

Stage II fixes weights of DD, and unfreezes FF for fine-tuning. A SimSiam-style contrastive module CC is stacked:

  • Projector: 3-layer MLP.
  • Predictor: 1-layer MLP. Feature maps are extracted from conv3 and conv4 layers for patchwise robustness.

For synthetic anomaly images IGI'_G, the pretrained DD estimates defect masks y^G\hat{y}'_G, which function as explicit negative guides. Focal loss penalizes feature assignments where F/CF/C mistakenly recognize anomalous features as normal.

3.2 Contrastive Loss and Bootstrap Negative Loss

  • Positive pairs: produced via random augmentations v1,v2v^1, v^2 of a normal image, features f1,f2f^1, f^2 projected to z1,z2z_1, z_2 and predicted as p1,p2p_1, p_2.
  • Negative "pseudo-pairs": direct negative pairing is avoided; instead, model penalizes cases where DD detects anomaly features assigned as normal.

Contrastive similarity (SimSiam) loss:

D(p1,z2)=p1p12z2z22\mathcal{D}(p_1, z_2) = -\frac{p_1}{\|p_1\|_2} \cdot \frac{z_2}{\|z_2\|_2}

Lcossim(f1,f2)=12D(p1,SG(z2))+12D(p2,SG(z1))L_{\mathrm{cossim}}(f^1, f^2) = \frac{1}{2}\mathcal{D}(p_1, SG(z_2)) + \frac{1}{2}\mathcal{D}(p_2, SG(z_1))

For MM augmented views, loss is averaged:

Lsym=2M(M1)i=1Mj>iLcossim(fi,fj)L_{\mathrm{sym}} = \frac{2}{M(M-1)} \sum_{i=1}^M \sum_{j>i} L_{\mathrm{cossim}}(f^i, f^j)

The negative bootstrap loss is focal loss on DD’s mask of synthetic anomalies:

Lneg=Lfocal(y^G,yG)L_{\mathrm{neg}} = \mathcal{L}_{\mathrm{focal}}(\hat{y}'_G, y'_G)

Total stage II loss:

Lncl=λLsym+(1λ)Lneg\mathcal{L}_{\mathrm{ncl}} = \lambda L_{\mathrm{sym}} + (1-\lambda) L_{\mathrm{neg}}

where λ[0,1]\lambda \in [0,1] balances contrastive tightness against defect-feature repulsion.

4. Training Algorithm and Key Hyperparameters

Training proceeds sequentially across both stages:

  • Stage I: Freeze FF, initialize DD. For each epoch (E1=100E_1=100), batches of normals are synthetically corrupted, DD learns to predict masks using Adam (lr=104\mathrm{lr} = 10^{-4}, decay at epochs 80/90).
  • Stage II: Freeze DD, jointly fine-tune FF and CC over E2E_2 epochs (E2=100E_2=100 for MVTec AD, E2=150E_2=150 for BTAD) with SGD (momentum=0.9\mathrm{momentum} = 0.9, wd=104\mathrm{wd} = 10^{-4}, cosine-annealed lr\mathrm{lr}).

Typical batch size is 16, focal loss parameters γ=2\gamma=2, αt=0.25\alpha_t=0.25; best performance observed for λ=0.5\lambda = 0.5 in ablation. Synthetic-anomaly generator: Perlin noise performs best, followed by CutPaste and NSA (Liang et al., 2024). SimSiam used for contrastive module, no temperature parameter (τ\tau) required.

5. Anomaly Scoring and Inference

Patchwise features from layer 2 & 3 (after fine-tuning) are used to construct a training memory bank MO\mathcal{M}_O, reduced by greedy coreset selection (\sim10%) per [Sener & Savarese '17]. Inference per test image ItI_t:

  1. Slide overlapping patch grid, extract patch features ptp_t.
  2. Find nearest coreset feature c=argmincMptc2c^* = \arg\min_{c \in \mathcal{M}} \|p_t - c\|_2.
  3. Compute raw anomaly score st=ptc2s'_t = \|p_t - c^*\|_2.
  4. Reweight confidence using neighboring features:

st=(1estcNb(c)eptc)sts_t = \Bigl(1 - \frac{e^{s'_t}}{\sum_{c' \in \mathcal{N}_b(c^*)} e^{\|p_t - c'\|}} \Bigl) \cdot s'_t

  1. Patchwise scores are upsampled to full resolution for pixel-level maps (bilinear/optional Gaussian smoothing). Image-level score is maxtst\max_t s_t.

6. Experimental Results and Ablation Studies

6.1 Datasets, Metrics, and Evaluation

  • MVTec AD: 15 classes, 3629 train / 1725 test images, pixel-level AUROC.
  • VisA: 9,621 normals, 1,200 anomalous, 12 subsets, pixel-level AUROC.
  • BTAD: Three product categories, ~1,800 train / ~1,000 test, pixel-level AUROC.

6.2 Quantitative Results

Dataset Image AUROC (%) Pixel AUROC (%) Method Comparison
MVTec AD 99.10 98.21 ToCoAD (Perlin) best overall
VisA 95.35 98.43 ToCoAD best pixel
BTAD - 97.70 ToCoAD best
  • Layer selection: best results from conv3+conv4.
  • Necessity of two-stage regime: only full regimen achieves top performance (~98.21% pixel-level), outperforming no-pretrain and "CL-only".
  • Synthetic anomaly diversity: Perlin noise generator yields richer defects, outperforming CutPaste and NSA.
  • SimSiam preferred over BYOL.

7. Insights, Limitations, and Research Directions

Stage I enables the network to generalize the concept of "anomaly" over a broad span of synthetic defect patterns, thus cultivating a defect prior. The stage II negative-guided bootstrap constrains the feature extractor to diverge from anomaly directions, while the contrastive objective contracts the normal feature locus, substantially minimizing overlap of normal and anomaly manifolds.

Current limitations include restricted diversity of synthetic anomaly generators; real defects may reside outside their support. Memory-bank compactness and selection, while effective, could benefit from learned or dynamic dictionaries. Extension to few-shot or semi-supervised anomaly detection via leveraging limited real anomaly samples is suggested as a plausible avenue for robustness. Stronger augmentations (e.g., AugMix), or adversarial synthetic-defect generation, may further improve generalizability.

In sum, ToCoAD establishes a robust framework bridging the typical ImageNet-to-industrial data gap by sequentially bootstrapping a discriminative defect prior and leveraging explicit negative guidance in contrastive backbone refinement, resulting in elevated anomaly detection performance across multiple industrial benchmarks (Liang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ToCoAD: Two-Stage Contrastive Learning.