FusWay: Multimodal Railway Defect Detection
- FusWay is a multimodal hybrid fusion architecture that combines visual features from YOLOv8n and event-driven audio cues to improve defect detection in railways.
- It employs a three-stage pipeline with synchronized video, specialized audio event segmentation, and element-wise fusion with spatial masking.
- Experimental results at high IoU thresholds demonstrate significant gains in overall accuracy, precision, and recall compared to vision-only models.
FusWay is a multimodal hybrid fusion architecture developed for railway defect detection, integrating image and audio modalities to address the limitations of single modality object detection systems. The framework is distinguished by its tight coupling of a YOLOv8n object detector with a domain-informed audio subsystem and a lightweight Vision Transformer (ViT) classifier, orchestrating a three-way classification across “Rupture,” “Surface defect,” and “Nothing” categories. Its upstream fusion mechanism modulates vision features using quantized event-driven audio, achieving statistically significant improvements in defect localization performance relative to vision-only baselines (Zhukov et al., 2 Sep 2025).
1. System Architecture and Data Flow
FusWay adopts a three-stage multimodal pipeline for defect classification:
- YOLOv8n Vision Subsystem: Processes each RGB video frame, extracting convolutional feature tensors from layers (), alongside bounding-box proposals.
- Audio Analyzer/Encoder: In parallel, it analyzes synchronously captured audio (sampled at 100 kHz over a 1 s temporal window), identifying transient events and producing, per event , a class probability vector , a normalized peak amplitude , and temporal segment .
- Fusion and Classification: The upstream fusion module modulates YOLO feature maps with audio-driven event tensors and spatial masking based on YOLO boxes. The fused multimodal tensors are linearized, tokenized, and passed into a ViT with an MLP head for final prediction.
The architecture is diagrammed as follows (in prose form): Video frames enter YOLOv8n, feature maps from selected layers and bounding boxes are produced, in parallel audio events are analyzed, resulting audio event tensors are fused into the visual feature maps, which are then fed to a ViT for three-class classification.
2. Vision Subsystem: YOLOv8n Feature Extraction
The backbone YOLOv8n model operates at 30 fps, extracting multi-resolution feature maps following the SiLU activation. Feature map extraction is performed on layers , balancing spatial resolution and semantic abstraction. For each layer, the output feature tensor is collapsed along the channel dimension via averaging, min–max normalized to , and stacked times—generating a -channel tensor, where 0 is the number of classes (3 in this context).
Explicitly, for layer 1:
- Channel-averaged tensor: 2
- Normalized: 3
- Class-sliced stacking: 4
No attention mechanisms or dimensionality projections are introduced before fusion; the Transformer classification stage manages inter-class interactions.
3. Audio Subsystem and Temporal Alignment
FusWay’s audio analyzer does not compute global spectrograms or MFCC representations. Instead, it employs a domain-informed approach:
- Raw waveforms are segmented into events, each yielding:
- Class probabilities 5
- Normalized amplitude 6
- Event duration 7
Temporal alignment maps event times to the 8 spatial rows of 9 using 0 and 1 for 2. Events fill a zero–initialized tensor 3 as:
4
where 5.
4. Fusion Methodology and Spatial Masking
Fusion proceeds via element-wise modulation:
- Compute the all-ones tensor 6.
- Construct a binary mask 7 over spatial locations from YOLO bounding boxes:
8
- Modulate vision features: 9
Here, 0 denotes element-wise multiplication. No additional thresholds, hand-tuned gating, or attention gates are introduced beyond YOLO’s default confidence and IoU-based pruning.
5. ViT Classifier and Training Protocol
The multimodally modulated tensors 1 from the three selected YOLO layers are reshaped and linearized into input token sequences. These are embedded with positional encodings and input to a windowed ViT (4 heads, window size 2). The ViT’s CLS token is passed through an MLP softmax head for three-way classification.
The loss function combines standard multi-class cross-entropy with optional 3 regularization:
4
where 5 denotes parameters, 6 the model predictions, 7 the one-hot ground truth, and 8 the weight decay hyperparameter.
6. Experimental Setup and Performance
The evaluation leverages a railway dataset with 22,172 labeled image boxes (18,737 training, 3,435 validation), spanning:
- Rupture (Seal): 8,156 samples
- Surface defect: 3,026 samples
- Nothing: 10,990 samples
Synthesized audio aligns with class-conditional statistics (peak ranges: Nothing 0–0.2, Surface 0.3–0.6, Rupture 0.8–1). One-vs-all metrics (Precision, Recall, Accuracy) are reported for IoU thresholds of 0.3, 0.5, and 0.7, comparing YOLOv8n alone versus FusWay (YOLOv8n + ViT).
Key test results at IoU=0.7 include: | Metric | YOLOv8n | FusWay | Absolute Gain | |----------------|---------|--------|---------------| | Overall ACC | 0.3493 | 0.5095 | +0.1602 | | Rupture P | 0.3945 | 0.6872 | +0.2927 | | Rupture R | 0.3843 | 0.6695 | +0.2852 |
Statistical significance is supported by Student's unpaired t-test: for mean accuracy at IoU=0.7, 9 (0); at IoU=0.5, 1, 2. At IoU=0.3, the difference is not significant (3), indicating that multimodal fusion provides the largest benefit at higher localization thresholds.
7. Context, Limitations, and Prospective Directions
FusWay's multimodal paradigm demonstrably mitigates false positives prevalent in visual-only defect detection, particularly by discriminating defects through their distinctive audio transients. The upstream fusion module capitalizes on event-driven audio cues to augment localized visual representations, resulting in improved precision under high IoU criteria.
Limitations are acknowledged: reliance on synthesized (rather than real) audio due to proprietary constraints, class imbalance (Surface defects under-represented), and the use of a simplified audio model without spectral decomposition. A plausible implication is that extending the audio front-end to include spectral features (e.g., STFT or MFCC) or incorporating additional sensor modalities (vibration, ultrasound) may yield further gains.
Overall, FusWay establishes that integrating lightweight YOLO feature extraction, domain-informed audio modulation, and transformer-based classification forms a tractable and effective pipeline for critical infrastructure defect detection, improving both aggregated accuracy and the reliability of localization at scale (Zhukov et al., 2 Sep 2025).