Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-View Spectrogram Transformers

Updated 22 June 2026
  • The paper introduces a Multi-View Spectrogram Transformer (MVST) that integrates multi-scale patching and adaptive gated fusion to capture detailed respiratory sound features.
  • MVST leverages five unique spectrogram views to effectively represent diverse acoustic events like crackles and wheezes through tailored time-frequency resolutions.
  • Empirical results on the ICBHI dataset demonstrate significant sensitivity gains and an improved average score, outperforming previous methods.

The Multi-View Spectrogram Transformer (MVST) is a deep learning framework for respiratory sound classification that integrates time–frequency acoustical characteristics into a vision transformer architecture via multi-scale spectrogram patching and adaptive feature fusion. Contrasting prior approaches treating spectrograms as generic images, MVST exploits their unique physical structure to better capture salient respiratory events such as crackles and wheezes, achieving state-of-the-art performance on the ICBHI respiratory sound dataset (He et al., 2023).

1. Multi-View Patch Construction of Mel-Spectrograms

MVST operates on mel-spectrograms SS with dimensions H×WH \times W (where H=256H=256 frequency bins, W=1 024W=1\,024 time frames). Unlike standard vision transformers (ViT) which apply a single fixed-size square patch (e.g., 16×1616 \times 16), MVST decomposes SS into five distinct tilings ("views") using non-overlapping patches:

View (ℓ\ell) Patch Size (μℓ×νℓ\mu_\ell \times \nu_\ell) Physical Resolution
0 256×1256 \times 1 Full frequency, 1 time step
1 128×2128 \times 2 128 freq, 2 time steps
2 H×WH \times W0 64 freq, 4 time steps
3 H×WH \times W1 32 freq, 8 time steps
4 H×WH \times W2 16 freq, 16 time steps

For all H×WH \times W3, H×WH \times W4 is held constant. This pyramid captures trade-offs between temporal and frequency resolution: H×WH \times W5 captures instantaneous full-band spectra, while H×WH \times W6 captures local time–frequency neighborhoods.

The motivation is that physiological events—e.g., brief crackles or sustained wheezes—differ in their spectro-temporal localization, and a single patch scale is typically suboptimal for all event types. Multi-view patching grants robustness to frequency shifts, enables focus at multiple granularities, and more faithfully models the physical characteristics of sounds in the mel domain.

2. Positional Embedding and Patch Tokenization

Each patch in view H×WH \times W7 (H×WH \times W8) is flattened to a vector of length H×WH \times W9, and projected to a H=256H=2560-dimensional token via a learnable linear mapping: H=256H=2561 where H=256H=2562.

To preserve the temporal (time interval) and spectral (frequency band) information, MVST introduces a learnable positional embedding H=256H=2563 for each view. The input to the first transformer block per view is then: H=256H=2564 This annotation ensures the model maintains awareness of the location of each patch in the original spectrogram, both in time and frequency.

3. Transformer Encoder Architecture

For each view H=256H=2565, H=256H=2566 is passed through H=256H=2567 stacked Transformer blocks, each including Layer Normalization (LN), Multi-Head Self-Attention (MSA), and a two-layer Feed-Forward Network (FFN), with residual connections for both sublayers. Formally, for block H=256H=2568:

  • Attention sublayer:

H=256H=2569

  • FFN sublayer:

W=1 024W=1\,0240

MSA uses scaled-dot product attention for each head: W=1 024W=1\,0241 where W=1 024W=1\,0242, W=1 024W=1\,0243, W=1 024W=1\,0244 and W=1 024W=1\,0245.

The FFN is implemented as: W=1 024W=1\,0246 where W=1 024W=1\,0247 is the GELU nonlinearity, and typically W=1 024W=1\,0248, W=1 024W=1\,0249, with 16×1616 \times 160.

Each view independently produces a feature map 16×1616 \times 161 after 16×1616 \times 162 layers.

4. Adaptive Gated Fusion Across Multi-View Features

To aggregate evidence from the five patch resolutions, MVST employs a gated fusion mechanism. For each view, a "gate" 16×1616 \times 163 is computed:

16×1616 \times 164

with 16×1616 \times 165 denoting the elementwise sigmoid and 16×1616 \times 166 trainable. Each gate modulates how much to trust view 16×1616 \times 167 for each token.

The final multi-view representation is an elementwise fusion,

16×1616 \times 168

where 16×1616 \times 169 is the Hadamard product. SS0 is then aggregated (via pooling) and input to a shallow MLP for respiratory class prediction.

This mechanism permits the model to dynamically select the most informative view for every local spectrogram region, facilitating adaptive focus on spectro-temporal structures salient to the task.

5. Experimental Protocol and Benchmark Results

MVST was validated on the ICBHI respiratory-sound corpus (6,898 annotated breathing cycles, sampled at 4–44.1 kHz), using a patient-level split: 60% for training, 40% for testing per challenge protocol. The principal evaluation metrics were specificity (SP), sensitivity (SE), and average score (AS = [SP + SE]/2).

Key performance results:

Model Specificity (SP) Sensitivity (SE) Average Score (AS)
MVST 81.99% 51.10% 66.55%
Previous best (AST + patch-mix CL) 81.66% 43.01% 62.37%

Training employed cross-entropy loss, AdamW optimizer (learning rate SS1, weight decay SS2), 50 epochs, and batch size 8. MVST improved average score by +4.18% over strong baselines, predominantly via marked sensitivity gains (+8.09%).

6. Role of Multi-View Patching and Gated Fusion Versus Image-Style Modeling

Spectrograms fundamentally differ from natural images: frequency and time axes have distinct physical semantics, and the mel scale is non-uniform, especially at high frequencies. Conventional ViT patching (SS3) can be misaligned—either too wide in time (blurring rapid transients) or too narrow in frequency (missing shifts of relevant spectral content).

Multi-view patching forms a granular "pyramid" ensuring that at least one view can capture each acoustic event optimally. Gated fusion adaptively adjudicates which views contribute most per instance and per dimension, rather than weighting all equally. This design permits robust extraction of respiratory phenomena, especially under real-world noise.

Empirically, this approach yields improved detection and discrimination of crackles, wheezes, and their combinations, validating the benefit of embedding spectrogram physics directly into the model inductive bias.

7. Significance in Time–Frequency Audio Modeling

MVST demonstrates the importance of physically-informed architectural choices for audio classification, especially where events of interest manifest at widely differing time–frequency scales. Embedding multiple patch granularities and adaptive fusion into the transformer framework bridges spectrogram physics with powerful attention-based modeling, providing both enhanced theoretical insight and empirically superior results on benchmark respiratory-sound classification (He et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-View Spectrogram Transformers (MVST).