Papers
Topics
Authors
Recent
Search
2000 character limit reached

FusWay: Multimodal Railway Defect Detection

Updated 4 May 2026
  • FusWay is a multimodal hybrid fusion architecture that combines visual features from YOLOv8n and event-driven audio cues to improve defect detection in railways.
  • It employs a three-stage pipeline with synchronized video, specialized audio event segmentation, and element-wise fusion with spatial masking.
  • Experimental results at high IoU thresholds demonstrate significant gains in overall accuracy, precision, and recall compared to vision-only models.

FusWay is a multimodal hybrid fusion architecture developed for railway defect detection, integrating image and audio modalities to address the limitations of single modality object detection systems. The framework is distinguished by its tight coupling of a YOLOv8n object detector with a domain-informed audio subsystem and a lightweight Vision Transformer (ViT) classifier, orchestrating a three-way classification across “Rupture,” “Surface defect,” and “Nothing” categories. Its upstream fusion mechanism modulates vision features using quantized event-driven audio, achieving statistically significant improvements in defect localization performance relative to vision-only baselines (Zhukov et al., 2 Sep 2025).

1. System Architecture and Data Flow

FusWay adopts a three-stage multimodal pipeline for defect classification:

  1. YOLOv8n Vision Subsystem: Processes each RGB video frame, extracting convolutional feature tensors from layers l=7,16,19l=7,16,19 (F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}), alongside bounding-box proposals.
  2. Audio Analyzer/Encoder: In parallel, it analyzes synchronously captured audio (sampled at 100 kHz over a 1 s temporal window), identifying transient events and producing, per event qq, a class probability vector Pq[0,1]KP_q \in [0,1]^K, a normalized peak amplitude Gq[0,1]G_q \in [0,1], and temporal segment [tq,tq+1][t_q, t_{q+1}].
  3. Fusion and Classification: The upstream fusion module modulates YOLO feature maps with audio-driven event tensors and spatial masking based on YOLO boxes. The fused multimodal tensors are linearized, tokenized, and passed into a ViT with an MLP head for final prediction.

The architecture is diagrammed as follows (in prose form): Video frames enter YOLOv8n, feature maps from selected layers and bounding boxes are produced, in parallel audio events are analyzed, resulting audio event tensors are fused into the visual feature maps, which are then fed to a ViT for three-class classification.

2. Vision Subsystem: YOLOv8n Feature Extraction

The backbone YOLOv8n model operates at 30 fps, extracting multi-resolution feature maps following the SiLU activation. Feature map extraction is performed on layers l=7,16,19l=7,16,19, balancing spatial resolution and semantic abstraction. For each layer, the output feature tensor is collapsed along the channel dimension via averaging, min–max normalized to [0,1][0,1], and stacked KK times—generating a KK-channel tensor, where F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}0 is the number of classes (3 in this context).

Explicitly, for layer F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}1:

  • Channel-averaged tensor: F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}2
  • Normalized: F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}3
  • Class-sliced stacking: F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}4

No attention mechanisms or dimensionality projections are introduced before fusion; the Transformer classification stage manages inter-class interactions.

3. Audio Subsystem and Temporal Alignment

FusWay’s audio analyzer does not compute global spectrograms or MFCC representations. Instead, it employs a domain-informed approach:

  • Raw waveforms are segmented into events, each yielding:
    • Class probabilities F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}5
    • Normalized amplitude F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}6
    • Event duration F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}7

Temporal alignment maps event times to the F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}8 spatial rows of F0lRCl×Wl×HlF_{0}^{\,l} \in \mathbb{R}^{C^l \times W^l \times H^l}9 using qq0 and qq1 for qq2. Events fill a zero–initialized tensor qq3 as:

qq4

where qq5.

4. Fusion Methodology and Spatial Masking

Fusion proceeds via element-wise modulation:

  • Compute the all-ones tensor qq6.
  • Construct a binary mask qq7 over spatial locations from YOLO bounding boxes:

qq8

  • Modulate vision features: qq9

Here, Pq[0,1]KP_q \in [0,1]^K0 denotes element-wise multiplication. No additional thresholds, hand-tuned gating, or attention gates are introduced beyond YOLO’s default confidence and IoU-based pruning.

5. ViT Classifier and Training Protocol

The multimodally modulated tensors Pq[0,1]KP_q \in [0,1]^K1 from the three selected YOLO layers are reshaped and linearized into input token sequences. These are embedded with positional encodings and input to a windowed ViT (4 heads, window size Pq[0,1]KP_q \in [0,1]^K2). The ViT’s CLS token is passed through an MLP softmax head for three-way classification.

The loss function combines standard multi-class cross-entropy with optional Pq[0,1]KP_q \in [0,1]^K3 regularization:

Pq[0,1]KP_q \in [0,1]^K4

where Pq[0,1]KP_q \in [0,1]^K5 denotes parameters, Pq[0,1]KP_q \in [0,1]^K6 the model predictions, Pq[0,1]KP_q \in [0,1]^K7 the one-hot ground truth, and Pq[0,1]KP_q \in [0,1]^K8 the weight decay hyperparameter.

6. Experimental Setup and Performance

The evaluation leverages a railway dataset with 22,172 labeled image boxes (18,737 training, 3,435 validation), spanning:

  • Rupture (Seal): 8,156 samples
  • Surface defect: 3,026 samples
  • Nothing: 10,990 samples

Synthesized audio aligns with class-conditional statistics (peak ranges: Nothing 0–0.2, Surface 0.3–0.6, Rupture 0.8–1). One-vs-all metrics (Precision, Recall, Accuracy) are reported for IoU thresholds of 0.3, 0.5, and 0.7, comparing YOLOv8n alone versus FusWay (YOLOv8n + ViT).

Key test results at IoU=0.7 include: | Metric | YOLOv8n | FusWay | Absolute Gain | |----------------|---------|--------|---------------| | Overall ACC | 0.3493 | 0.5095 | +0.1602 | | Rupture P | 0.3945 | 0.6872 | +0.2927 | | Rupture R | 0.3843 | 0.6695 | +0.2852 |

Statistical significance is supported by Student's unpaired t-test: for mean accuracy at IoU=0.7, Pq[0,1]KP_q \in [0,1]^K9 (Gq[0,1]G_q \in [0,1]0); at IoU=0.5, Gq[0,1]G_q \in [0,1]1, Gq[0,1]G_q \in [0,1]2. At IoU=0.3, the difference is not significant (Gq[0,1]G_q \in [0,1]3), indicating that multimodal fusion provides the largest benefit at higher localization thresholds.

7. Context, Limitations, and Prospective Directions

FusWay's multimodal paradigm demonstrably mitigates false positives prevalent in visual-only defect detection, particularly by discriminating defects through their distinctive audio transients. The upstream fusion module capitalizes on event-driven audio cues to augment localized visual representations, resulting in improved precision under high IoU criteria.

Limitations are acknowledged: reliance on synthesized (rather than real) audio due to proprietary constraints, class imbalance (Surface defects under-represented), and the use of a simplified audio model without spectral decomposition. A plausible implication is that extending the audio front-end to include spectral features (e.g., STFT or MFCC) or incorporating additional sensor modalities (vibration, ultrasound) may yield further gains.

Overall, FusWay establishes that integrating lightweight YOLO feature extraction, domain-informed audio modulation, and transformer-based classification forms a tractable and effective pipeline for critical infrastructure defect detection, improving both aggregated accuracy and the reliability of localization at scale (Zhukov et al., 2 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FusWay.