Weld-4M Benchmark: Multimodal Anomaly Detection

Updated 1 January 2026

Weld-4M is a benchmark dataset that synchronizes RGB video, acoustic spectrograms, high-frequency sensor streams, and post-weld images for unsupervised anomaly detection in robotic welding.
It employs hierarchical fusion and causal modeling to overcome modality heterogeneity and causal blindness, achieving robust metrics (I–AUROC 90.7%, I–AP 99.1%, I–F1-max 97.4%).
The dataset enables rigorous evaluation under realistic factory conditions, addressing challenges like class imbalance and noise while setting a unified protocol for process-aware UAD.

Weld-4M is a multimodal benchmark dataset designed for unsupervised anomaly detection (UAD) in robotic welding, focusing on the restoration of physical generative logic via hierarchical fusion and causal modeling. It provides synchronized, high-fidelity multimodal data—including RGB video, acoustic emission spectrograms, high-frequency sensor streams, and post-weld images—collected from a robotic welding station in a real factory environment. Developed and introduced alongside the Causal-HM framework, Weld-4M addresses limitations in existing benchmarks, notably causal blindness, modality heterogeneity gaps, and robustness to realistic factory noise, establishing a rigorous baseline and protocol for evaluating physical-process-aware anomaly detection methods (Liu et al., 25 Dec 2025).

1. Benchmark Composition and Data Modalities

Weld-4M comprises 4,040 weld cycles, each sample synchronizing four distinct modalities:

Real-time RGB Video (“process” modality): Captured at 30 fps with an unspecified resolution, preprocessed using a frozen V-JEPA2 backbone (vitl-fpc64-256).
Acoustic Emission Spectrograms: 192 kHz raw audio transformed to spectrograms using a frozen AST backbone. Spectrogram dimensions are $F \times T_a$ .
Sensor Time-Series: High-frequency current and voltage data sampled from sensors mounted on the welding torch, encoded via Mamba-SSM ( $X_s \in \mathbb{R}^{T_s \times C_s}$ ).
Post-weld Color Images (“result” modality): $M$ multiview stills per weld, each $3 \times H \times W$ , featurized with a frozen DINOv3.

Data was acquired under real manufacturing conditions—with natural variations in lighting and background noise. All modalities are tightly synchronized per welding cycle, ensuring temporal alignment between process and result observations.

Modality	Format	Capture Rate	# Per Sample
Video (P)	RGB clip, $T_v \times 3 \times H \times W$	30 fps	1 clip ( $\sim T_v$ frames)
Audio (P)	Spectrogram $F \times T_a$	192 kHz $\to$ AST	1 clip
Sensor (P)	Time-series $T_s \times C_s$	high-frequency sensors	1 signal
Post-weld (R)	$M$ images, $X_s \in \mathbb{R}^{T_s \times C_s}$ 0	static	$X_s \in \mathbb{R}^{T_s \times C_s}$ 1 viewpoints

The training set contains 576 defect-free “good” cycles, while the remaining approximately 3,464 samples, comprising both normal and anomalous welds, are reserved for testing.

2. Defect Taxonomy and Dataset Structure

Weld-4M encodes an explicit taxonomy of 12 defect categories (11 reported in summary tables; one referenced in text):

Excessive Convexity
Undercut
Lack of Fusion (process-hidden)
Porosity (internal)
Spatter
Burnthrough
Porosity w/ EP (electrode penetration)
Excessive Penetration
Crater Cracks 10. Warping
Overlap
[One additional class implied by the text]

Defect Types:

Surface defects (e.g., spatter, undercut, warping, excessive penetration) manifest visually and are typically accessible via post-weld images.
Hidden defects (Lack of Fusion, internal Porosity) may yield nominally acceptable beads in post-weld inspection, requiring causal cross-modal inference.

There exists marked class imbalance: each defect class is less frequent than normals, some with fewer than 200 examples (notably hidden defects). Training is exclusively on defect-free samples to emulate real-world unsupervised anomaly detection; the test set is mixed, including both unseen normals and anomalies.

3. Feature Extraction and Preprocessing Pipeline

Modalities are processed using state-of-the-art, frozen backbones:

Video: Frames are resized for V-JEPA2-vitl-fpc64-256, outputting $X_s \in \mathbb{R}^{T_s \times C_s}$ 2.
Audio: Raw audio is first transformed into spectrograms, then passed through frozen AST to yield $X_s \in \mathbb{R}^{T_s \times C_s}$ 3.
Sensor: Time-series $X_s \in \mathbb{R}^{T_s \times C_s}$ 4 are channel-wise normalized and encoded using Mamba-SSM, outputting a final hidden state $X_s \in \mathbb{R}^{T_s \times C_s}$ 5, internally $X_s \in \mathbb{R}^{T_s \times C_s}$ 6.
Images: Each of the $X_s \in \mathbb{R}^{T_s \times C_s}$ 7 post-weld images is processed via frozen DINOv3, yielding $X_s \in \mathbb{R}^{T_s \times C_s}$ 8.

All features are projected to a unified hidden dimension $X_s \in \mathbb{R}^{T_s \times C_s}$ 9 (except for sensor’s internal use of $M$ 0), facilitating modality alignment in subsequent fusion and anomaly-detection stages. Synchronization between modalities is a strict requirement of the benchmark, supporting precise P→R (Process-to-Result) causal modeling.

4. Experimental Protocol and Evaluation Metrics

The standard protocol prescribes training solely on the 576 good samples, with evaluation on all 4,040 cycles (test set normal and anomalous). No explicit validation or k-fold splits are defined; hyperparameter tuning is conducted either via held-out good samples or ablation on the test set, as reported in (Liu et al., 25 Dec 2025).

Anomaly detection performance is measured using three principal metrics:

Image-level AUROC (I–AUROC): Area under the receiver operating characteristic for post-weld image-level anomaly scores.

$M$ 1

Image-level Average Precision (I–AP): Area under the precision-recall curve at the image level.
Image-level F1-max (I–F1-max): Maximum F1 score attained across all possible thresholds.

Performance must be reported for all three metrics due to the severe class imbalance in the benchmark.

5. Baselines and Comparative Evaluation

Weld-4M supports comparison across 18 baseline methods, stratified by modality and fusion paradigm:

Unimodal Image: Dinomaly, PatchCore, SimpleNet, ViTAD, etc.
Unimodal Audio: AST spectrogram autoencoder.
Multimodal Flat-Fusion: LateFusion-Video, LateFusion-Audio, LateFusion-Fusion, Concat-AE.
State-of-the-art Multimodal and Reconstruction: BTF, CFM, M3DM, 3D-ADNAS, Reconstruct (ReContrast), MVAD, RealNet, MambaAD, RD++, UniAD.

Table 1 in (Liu et al., 25 Dec 2025) provides per-category AUCs as well as overall I–AUROC, I–AP, and I–F1-max. Causal-HM achieves state-of-the-art performance: I–AUROC $M$ 2, I–AP $M$ 3, I–F1-max $M$ 4.

6. Insights, Challenges, and Usage Recommendations

Weld-4M reveals several critical challenges for multimodal UAD in manufacturing:

Causal Blindness: Symmetric, flat fusion approaches are unable to capture the physical generative logic (P→R), especially for process-hidden defects.
Heterogeneity Gap: High-dimensional visual/audio embeddings ( $M$ 5512D from hundreds of thousands of sensor bins) can overwhelm the contributions of 2–10 channel sensor streams in late or flat fusion (“drowning out”).
Class Imbalance: Severe imbalance, particularly for rare and hidden defects, necessitates robust metric reporting and training on “good-only” samples.
Over-generalization: Pure reconstruction models (e.g., autoencoders) often degenerate to identity mappings and fail to flag anomalous results.
Robustness under Noise: Realistic factory noise has a disproportionate effect on naïve multimodal systems compared to architectures respecting process/result causality (e.g., Causal-HM).

Recommended Usage:

Strictly adhere to the “good-only” training protocol and report I–AUROC, I–AP, and I–F1-max.
Use fixed, state-of-the-art backbones (V-JEPA2, AST, DINOv3) for modality-specific feature extraction.
Treat sensor streams as causal governors (P→R) and adopt hierarchical or modulation-based fusion rather than treating all modalities as parallel.
Maintain synchronized timestamp alignment and the train/test split if extending the dataset or adding modalities.
Evaluate robustness explicitly by introducing synthetic noise across modalities and reporting performance degradation.

7. Benchmark Significance and Impact

Weld-4M systematically addresses the limitations of previous multimodal anomaly detection datasets in manufacturing by operationalizing the physical generative process from sensor/process to result. Its design enables the evaluation of UAD frameworks under realistic conditions—including severe modality imbalance, the presence of hidden-defect classes, and real-world factory noise.

The benchmark incentivizes research into multimodal fusion strategies that preserve causal dependencies and physical logic. By providing a comprehensive evaluation platform with a unified protocol, Weld-4M facilitates rigorous, direct comparison of models targeting process-aware anomaly detection in industrial contexts (Liu et al., 25 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Causal-HM: Restoring Physical Generative Logic in Multimodal Anomaly Detection via Hierarchical Modulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weld-4M Benchmark.