ToCoAD: Two-Stage Contrastive Learning

Updated 6 January 2026

The paper introduces a two-stage training paradigm that leverages synthetic anomaly generation and bootstrap contrastive fine-tuning to bridge the domain gap in industrial anomaly detection.
The methodology enhances pixel-level defect localization using focal loss and SimSiam-based contrastive learning, achieving state-of-the-art performance on benchmarks such as MVTec AD, VisA, and BTAD.
The approach refines the feature space by repelling anomalous directions and contracting normal feature clusters, offering robust improvements for industrial inspection tasks.

ToCoAD (Two-Stage Contrastive Learning for Industrial Anomaly Detection) is a training paradigm developed to address the limitations of conventional unsupervised anomaly detection (A/D) in industrial contexts, particularly the domain gap between general-purpose feature extractors and the statistics of industrial data. ToCoAD employs synthetic anomaly generation and a bootstrap contrastive fine-tuning regimen, yielding state-of-the-art pixel-level anomaly localization performance on benchmarks such as MVTec AD, VisA, and BTAD (Liang et al., 2024).

1. Motivation: Bridging the Domain Gap in Industrial Anomaly Detection

Most industrial A/D pipelines rely on frozen feature extractors pretrained on ImageNet. However, industrial defects are often subtle and domain-specific, resulting in "normal" and "anomalous" image patches that diverge significantly from standard ImageNet feature manifolds. Such domain gaps severely hamper the discrimination of subtle defect cues; frozen representations typically fail to generalize to the nuanced statistical patterns of industrial samples. ToCoAD addresses this by (i) bootstrapping a discriminative prior using self-supervised synthetic anomalies, and (ii) explicitly repelling anomalous directions during joint contrastive fine-tuning, thus morphing the feature space to be more sensitive and generalizable to industrial anomaly types.

2. Stage I: Discriminative Network Pre-training via Synthetic Anomalies

2.1 Synthetic Anomaly Generation

The synthetic anomaly pipeline begins with a defect-free image $I$ . Geometric augmentations (random rotations from $\{0°,90°,180°,270°\}$ ) are applied to enrich invariance. An anomaly mask $P$ is generated using Perlin noise (alternatives: CutPaste, Poisson-seamless patches as in NSA, AugMix style blended anomalies). Optionally, a reference texture patch $A$ from datasets like DTD is masked and pasted onto $I$ to form the synthetic-anomalous image $I_G$ and dense ground-truth mask $y_G \in \{0,1\}^{H \times W}$ .

2.2 Network Architecture and Training Objective

Feature extractor $F$ : frozen WideResNet-50, pretrained on ImageNet.
Decoding head $D$ : symmetric (inverse WideResNet) with UNet-style skip-connections, outputting defect mask $\hat{y}_G \in \mathbb{R}^{H \times W \times 2}$ .
Objective: pixel-wise focal loss to overcome severe class imbalance:

$\mathcal{L}_{\mathrm{disc}} = \mathcal{L}_{\mathrm{focal}}(\hat{y}_G, y_G) = -\sum_{t \in \{0,1\}} \alpha_t (1 - p_t)^\gamma \log p_t$

where $p_t$ is predicted probability for class $t$ per pixel, $\alpha_t$ is the balancing factor, and $\gamma > 0$ modulates down-weighting of easy examples.

3. Stage II: Negative-Guided Bootstrap Contrastive Learning

3.1 Architecture and Negative-Feature Guidance

Stage II fixes weights of $D$ , and unfreezes $F$ for fine-tuning. A SimSiam-style contrastive module $C$ is stacked:

Projector: 3-layer MLP.
Predictor: 1-layer MLP. Feature maps are extracted from conv3 and conv4 layers for patchwise robustness.

For synthetic anomaly images $I'_G$ , the pretrained $D$ estimates defect masks $\hat{y}'_G$ , which function as explicit negative guides. Focal loss penalizes feature assignments where $F/C$ mistakenly recognize anomalous features as normal.

3.2 Contrastive Loss and Bootstrap Negative Loss

Positive pairs: produced via random augmentations $v^1, v^2$ of a normal image, features $f^1, f^2$ projected to $z_1, z_2$ and predicted as $p_1, p_2$ .
Negative "pseudo-pairs": direct negative pairing is avoided; instead, model penalizes cases where $D$ detects anomaly features assigned as normal.

Contrastive similarity (SimSiam) loss:

$\mathcal{D}(p_1, z_2) = -\frac{p_1}{\|p_1\|_2} \cdot \frac{z_2}{\|z_2\|_2}$

$L_{\mathrm{cossim}}(f^1, f^2) = \frac{1}{2}\mathcal{D}(p_1, SG(z_2)) + \frac{1}{2}\mathcal{D}(p_2, SG(z_1))$

For $M$ augmented views, loss is averaged:

$L_{\mathrm{sym}} = \frac{2}{M(M-1)} \sum_{i=1}^M \sum_{j>i} L_{\mathrm{cossim}}(f^i, f^j)$

The negative bootstrap loss is focal loss on $D$ ’s mask of synthetic anomalies:

$L_{\mathrm{neg}} = \mathcal{L}_{\mathrm{focal}}(\hat{y}'_G, y'_G)$

Total stage II loss:

$\mathcal{L}_{\mathrm{ncl}} = \lambda L_{\mathrm{sym}} + (1-\lambda) L_{\mathrm{neg}}$

where $\lambda \in [0,1]$ balances contrastive tightness against defect-feature repulsion.

4. Training Algorithm and Key Hyperparameters

Training proceeds sequentially across both stages:

Stage I: Freeze $F$ , initialize $D$ . For each epoch ( $E_1=100$ ), batches of normals are synthetically corrupted, $D$ learns to predict masks using Adam ( $\mathrm{lr} = 10^{-4}$ , decay at epochs 80/90).
Stage II: Freeze $D$ , jointly fine-tune $F$ and $C$ over $E_2$ epochs ( $E_2=100$ for MVTec AD, $E_2=150$ for BTAD) with SGD ( $\mathrm{momentum} = 0.9$ , $\mathrm{wd} = 10^{-4}$ , cosine-annealed $\mathrm{lr}$ ).

Typical batch size is 16, focal loss parameters $\gamma=2$ , $\alpha_t=0.25$ ; best performance observed for $\lambda = 0.5$ in ablation. Synthetic-anomaly generator: Perlin noise performs best, followed by CutPaste and NSA (Liang et al., 2024). SimSiam used for contrastive module, no temperature parameter ( $\tau$ ) required.

5. Anomaly Scoring and Inference

Patchwise features from layer 2 & 3 (after fine-tuning) are used to construct a training memory bank $\mathcal{M}_O$ , reduced by greedy coreset selection ( $\sim$ 10%) per [Sener & Savarese '17]. Inference per test image $I_t$ :

Slide overlapping patch grid, extract patch features $p_t$ .
Find nearest coreset feature $c^* = \arg\min_{c \in \mathcal{M}} \|p_t - c\|_2$ .
Compute raw anomaly score $s'_t = \|p_t - c^*\|_2$ .
Reweight confidence using neighboring features:

$s_t = \Bigl(1 - \frac{e^{s'_t}}{\sum_{c' \in \mathcal{N}_b(c^*)} e^{\|p_t - c'\|}} \Bigl) \cdot s'_t$

Patchwise scores are upsampled to full resolution for pixel-level maps (bilinear/optional Gaussian smoothing). Image-level score is $\max_t s_t$ .

6. Experimental Results and Ablation Studies

6.1 Datasets, Metrics, and Evaluation

MVTec AD: 15 classes, 3629 train / 1725 test images, pixel-level AUROC.
VisA: 9,621 normals, 1,200 anomalous, 12 subsets, pixel-level AUROC.
BTAD: Three product categories, ~1,800 train / ~1,000 test, pixel-level AUROC.

6.2 Quantitative Results

Dataset	Image AUROC (%)	Pixel AUROC (%)	Method Comparison
MVTec AD	99.10	98.21	ToCoAD (Perlin) best overall
VisA	95.35	98.43	ToCoAD best pixel
BTAD	-	97.70	ToCoAD best

Layer selection: best results from conv3+conv4.
Necessity of two-stage regime: only full regimen achieves top performance (~98.21% pixel-level), outperforming no-pretrain and "CL-only".
Synthetic anomaly diversity: Perlin noise generator yields richer defects, outperforming CutPaste and NSA.
SimSiam preferred over BYOL.

7. Insights, Limitations, and Research Directions

Stage I enables the network to generalize the concept of "anomaly" over a broad span of synthetic defect patterns, thus cultivating a defect prior. The stage II negative-guided bootstrap constrains the feature extractor to diverge from anomaly directions, while the contrastive objective contracts the normal feature locus, substantially minimizing overlap of normal and anomaly manifolds.

Current limitations include restricted diversity of synthetic anomaly generators; real defects may reside outside their support. Memory-bank compactness and selection, while effective, could benefit from learned or dynamic dictionaries. Extension to few-shot or semi-supervised anomaly detection via leveraging limited real anomaly samples is suggested as a plausible avenue for robustness. Stronger augmentations (e.g., AugMix), or adversarial synthetic-defect generation, may further improve generalizability.

In sum, ToCoAD establishes a robust framework bridging the typical ImageNet-to-industrial data gap by sequentially bootstrapping a discriminative defect prior and leveraging explicit negative guidance in contrastive backbone refinement, resulting in elevated anomaly detection performance across multiple industrial benchmarks (Liang et al., 2024).

PDF Markdown Chat (Pro)

References (1)

ToCoAD: Two-Stage Contrastive Learning for Industrial Anomaly Detection (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ToCoAD: Two-Stage Contrastive Learning.