Dual-Modality Anomaly Synthesis (DAS3D)

Updated 18 May 2026

The paper introduces DAS3D, which synthesizes correlated anomalies in RGB images and depth maps to overcome limited defect data in industrial settings.
It applies structured depth perturbations and texture blending via skew-Gaussian filtering and noise-based masking to simulate realistic defects.
The approach demonstrates robust performance with high I-AUROC and AUPRO metrics, highlighting its efficacy in self-supervised anomaly detection.

Dual-Modality Anomaly Synthesis (DAS3D) refers to a class of computational frameworks and data augmentation strategies designed to synthesize realistic anomalies simultaneously in 2D (RGB) and 3D (depth or point cloud) modalities. The approach aims to address the challenges of industrial anomaly detection, particularly in scenarios where real annotated defect samples are rare or unavailable, and multi-modal data (image + geometry) are jointly leveraged. By constructing paired synthetic anomalies in both views, these frameworks facilitate robust self-supervised anomaly detection, data augmentation, and provide a foundation for advancing fusion-based discriminative and generative models (Li et al., 2024).

1. Formalization and Rationale

DAS3D builds on the observation that while anomaly synthesis is well-established for 2D defect detection, extension to multi-modality—especially RGB + depth or point cloud (i.e., 3D)—requires methods that preserve spatial coherence and modality consistency. The central object is a dual-modality data tuple: $\mathcal{T} = \{(I_i, Z_i)\}_{i=1}^N,$ with $I_i \in \mathbb{R}^{H\times W\times 3}$ an RGB image and $Z_i \in \mathbb{R}^{H\times W}$ an aligned depth map. The core design is a pair of synthesis operators,

$\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$

where $Z_a$ is a synthetic anomalous depth map and $I_a$ is an anomalous RGB image constructed to be spatially correlated with the defect region $M^*$ identified in depth. This mutual synthesis closes the distributional gap between 2D and 3D modalities, supporting discriminative fusion and yielding paired “normal” and “anomalous” samples for each modality (Li et al., 2024). Direct applications include self-supervised reconstruction-based detection and few-shot anomaly classification (Zuo et al., 2024).

2. Anomaly Synthesis Techniques

Depth anomaly synthesis ( $\phi_{3D}$ ) is realized by generating structured surface perturbations resembling industrial defects (pits, bumps, scratches):

Foreground segmentation via depth thresholding: $M_f[i,j] = \mathbf{1}\{Z[i,j]<t_f\}$ .
Localized ternary mask construction over Perlin noise:

$M_p[i,j] = \begin{cases} +1,& P[i,j] > t_p, \ -1,& P[i,j] < -t_p, \ 0,& \text{otherwise}, \end{cases}$

where $I_i \in \mathbb{R}^{H\times W\times 3}$ 0 is a Perlin noise field and $I_i \in \mathbb{R}^{H\times W\times 3}$ 1 a threshold.

Skew-Gaussian convolution for spatial smoothing and directional bias:

$I_i \in \mathbb{R}^{H\times W\times 3}$ 2

(with notation as defined in (Li et al., 2024)).

Depth variation:

$I_i \in \mathbb{R}^{H\times W\times 3}$ 3

where $I_i \in \mathbb{R}^{H\times W\times 3}$ 4.

The corresponding RGB anomaly ( $I_i \in \mathbb{R}^{H\times W\times 3}$ 5) uses the mask $I_i \in \mathbb{R}^{H\times W\times 3}$ 6 derived from depth synthesis: $I_i \in \mathbb{R}^{H\times W\times 3}$ 7 with $I_i \in \mathbb{R}^{H\times W\times 3}$ 8 a random texture patch (e.g., from the DTD dataset). This design mimics texture contamination, color shifts, or physical damage aligned with the synthesized geometric defects. In extension, some systems generate anomalies directly in latent space (e.g., via a Gaussian mixture model over the shared 2D/3D embedding (Ali et al., 20 Oct 2025)) and decode them back to reconstruct anomalous samples in both modalities.

3. Network Architectures for Dual-Modality Detection

DAS3D approaches typically leverage a reconstruction-based discriminative architecture with fused multimodal backbones:

Dual UNet-like reconstruction modules $I_i \in \mathbb{R}^{H\times W\times 3}$ 9 and $Z_i \in \mathbb{R}^{H\times W}$ 0 seek to recover clean depth ( $Z_i \in \mathbb{R}^{H\times W}$ 1) and RGB ( $Z_i \in \mathbb{R}^{H\times W}$ 2) from the anomalous inputs $Z_i \in \mathbb{R}^{H\times W}$ 3, respectively.
Cross-modal fusion occurs at the feature level: shallow and deep features from both paths are concatenated (e.g., using channel-wise upsampling) and fed to a joint discriminator $Z_i \in \mathbb{R}^{H\times W}$ 4, itself often a UNet, which outputs a pixel-level anomaly mask $Z_i \in \mathbb{R}^{H\times W}$ 5 (Li et al., 2024).
Losses comprise: depth L2-reconstruction ( $Z_i \in \mathbb{R}^{H\times W}$ 6), RGB L2 + SSIM reconstruction ( $Z_i \in \mathbb{R}^{H\times W}$ 7), and focal loss for mask supervision ( $Z_i \in \mathbb{R}^{H\times W}$ 8):

$Z_i \in \mathbb{R}^{H\times W}$ 9

Augmentation dropout regularizes the modality fusion: randomly only one branch (RGB or depth) receives the anomaly, preventing shortcut learning and improving robustness to real-world single-modality artifacts (Li et al., 2024).

Alternative realizations, such as cross-modal latent synthesis, use feature extractors (e.g., ViT, PointMAE), concatenated through a fusion encoder and decoded with attention modules (CBAM) for each modality. Anomaly synthesis in latent space further exploits Gaussian priors and adversarial objectives to enrich the diversity and “realism” of generated defects (Ali et al., 20 Oct 2025).

4. Training Protocols and Synthesis Dropout

The DAS3D training procedure consists in generating on-the-fly synthetic anomalous samples for each batch and optimizing the network to reconstruct normal images and depths, while learning to discriminate synthetic defects:

For a normal training pair $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 0, sample structured noise and skewed filters to synthesize $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 1 and corresponding mask $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 2.
With independent Bernoulli sampling, randomly replace $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 3 or $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 4 to enforce adaptation to single-modality or dual anomalies; propagate this to mask labels.
UNet subnetworks reconstruct $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 5; features are fused and fed into $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 6 to output prediction $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 7.
Composite loss is minimized with Adam (typical learning rate $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 8) over 50 epochs, with batch size 8–16 and stochastic annealed dropout rates.
Inference proceeds by reconstructing from unaugmented test pairs; the anomaly mask $\phi_{3D}: Z \rightarrow Z_a, \qquad \phi_{RGB}: (I, M^*) \rightarrow I_a,$ 9 is taken as output, and image-level anomaly scores can be $Z_a$ 0.

Key innovations include augmentation dropout—critical for forcing the detector to exploit both depth and color cues—and cycle-consistency or adversarial regularization when synthesis is performed in latent space (Ali et al., 20 Oct 2025). Such protocols have emerged as a robust solution in data-scarce industrial settings.

5. Evaluation Procedures and Performance

DAS3D and related dual-modality synthesis frameworks are typically benchmarked on the MVTec 3D-AD (industrial parts) and Eyescandies datasets. The principal evaluation metrics are:

I-AUROC (image-level area under ROC),
P-AUROC (pixel-level ROC),
AUPRO (area under per-region overlap curve).

Reported results for the core DAS3D approach include:

On MVTec 3D-AD: I-AUROC = 0.982, AUPRO = 0.975, outperforming preceding methods such as ShapeGuided (I-AUROC 0.947), M3DM (0.945), and 3DSR (0.978).
On Eyescandies: I-AUROC = 0.915, AUPRO = 0.927 (Li et al., 2024).
Ablation studies confirm the importance of skew-Gaussian synthesis (removal drops I-AUROC to 0.959), and dropout (removal drops to 0.941), with both ablations compounding detrimentally.
Inference is efficient: DAS3D achieves 0.041s per sample and 3.9GB GPU memory versus ShapeGuided (5.8s/6.5GB) and M3DM (2.5s/11GB).

Qualitative analysis demonstrates that synthesized depth perturbations and texture blends mimic real-world surface anomalies, and the dual-modality network cleanly localizes a range of artifact types with high precision (Li et al., 2024). Related fusion techniques (e.g., attention-driven decoders and latent cross-modal synthesis) further refine anomaly sensitivity and enable generative capabilities for simulation and augmentation (Ali et al., 20 Oct 2025).

6. Relation to Alternative Dual-Modality Pipelines

While DAS3D implements anomaly synthesis directly in RGB + depth space, alternative designs—including those in CLIP3D-AD (Zuo et al., 2024) and MAFR (Ali et al., 20 Oct 2025)—exploit multi-view rendering or latent cross-modal fusion to harmonize and enhance the utility of both modalities. In CLIP3D-AD:

Synthetic “defect” patches and masks are generated via foreground masking and Perlin noise, transferred to both RGB and multi-view point cloud renderings;
A frozen CLIP backbone and multimodal adapters, coarse-to-fine decoders, and multi-view fusion underpin few-shot learning performance.
On-the-fly anomaly pair synthesis enables contrastive and segmentation training without large memory banks;
Achieves competitive few-shot 3D anomaly detection and segmentation.

MAFR-based DAS3D variants build a shared latent with cross-modal reconstruction losses, enforce cross-modality alignment using synthesis loss, and inject adversarial and cycle-consistency losses for robust OOD anomaly generation (Ali et al., 20 Oct 2025). Both architectural philosophies reinforce the principle that explicit dual-modality synthesis—whether in raw or latent space—improves detection sensitivity and generalizes better in industrial applications.

7. Implications and Outlook

The dual-modality anomaly synthesis paradigm systematically advances the state-of-the-art for 3D anomaly detection by exploiting structured, realistic synthetic defects across both geometry and appearance. Key design elements—such as skew-Gaussian depth perturbations, texture mixing, augmentation dropout, attention-guided restoration, and joint discrimination—jointly regularize the learned models for high precision, efficient inference, and robustness to incomplete modality artifacts. The approach achieves or surpasses prior work on multiple industrial detection benchmarks. A plausible implication is that continued integration of latent generative modeling, cross-modal attention, and more sophisticated 3D synthesis schemes will further close the gap between real-world defect detection and self-supervised learning in multi-modal contexts (Li et al., 2024, Ali et al., 20 Oct 2025, Zuo et al., 2024).