Multi-View Spectrogram Transformers
- The paper introduces a Multi-View Spectrogram Transformer (MVST) that integrates multi-scale patching and adaptive gated fusion to capture detailed respiratory sound features.
- MVST leverages five unique spectrogram views to effectively represent diverse acoustic events like crackles and wheezes through tailored time-frequency resolutions.
- Empirical results on the ICBHI dataset demonstrate significant sensitivity gains and an improved average score, outperforming previous methods.
The Multi-View Spectrogram Transformer (MVST) is a deep learning framework for respiratory sound classification that integrates time–frequency acoustical characteristics into a vision transformer architecture via multi-scale spectrogram patching and adaptive feature fusion. Contrasting prior approaches treating spectrograms as generic images, MVST exploits their unique physical structure to better capture salient respiratory events such as crackles and wheezes, achieving state-of-the-art performance on the ICBHI respiratory sound dataset (He et al., 2023).
1. Multi-View Patch Construction of Mel-Spectrograms
MVST operates on mel-spectrograms with dimensions (where frequency bins, time frames). Unlike standard vision transformers (ViT) which apply a single fixed-size square patch (e.g., ), MVST decomposes into five distinct tilings ("views") using non-overlapping patches:
| View () | Patch Size () | Physical Resolution |
|---|---|---|
| 0 | Full frequency, 1 time step | |
| 1 | 128 freq, 2 time steps | |
| 2 | 0 | 64 freq, 4 time steps |
| 3 | 1 | 32 freq, 8 time steps |
| 4 | 2 | 16 freq, 16 time steps |
For all 3, 4 is held constant. This pyramid captures trade-offs between temporal and frequency resolution: 5 captures instantaneous full-band spectra, while 6 captures local time–frequency neighborhoods.
The motivation is that physiological events—e.g., brief crackles or sustained wheezes—differ in their spectro-temporal localization, and a single patch scale is typically suboptimal for all event types. Multi-view patching grants robustness to frequency shifts, enables focus at multiple granularities, and more faithfully models the physical characteristics of sounds in the mel domain.
2. Positional Embedding and Patch Tokenization
Each patch in view 7 (8) is flattened to a vector of length 9, and projected to a 0-dimensional token via a learnable linear mapping: 1 where 2.
To preserve the temporal (time interval) and spectral (frequency band) information, MVST introduces a learnable positional embedding 3 for each view. The input to the first transformer block per view is then: 4 This annotation ensures the model maintains awareness of the location of each patch in the original spectrogram, both in time and frequency.
3. Transformer Encoder Architecture
For each view 5, 6 is passed through 7 stacked Transformer blocks, each including Layer Normalization (LN), Multi-Head Self-Attention (MSA), and a two-layer Feed-Forward Network (FFN), with residual connections for both sublayers. Formally, for block 8:
- Attention sublayer:
9
- FFN sublayer:
0
MSA uses scaled-dot product attention for each head: 1 where 2, 3, 4 and 5.
The FFN is implemented as: 6 where 7 is the GELU nonlinearity, and typically 8, 9, with 0.
Each view independently produces a feature map 1 after 2 layers.
4. Adaptive Gated Fusion Across Multi-View Features
To aggregate evidence from the five patch resolutions, MVST employs a gated fusion mechanism. For each view, a "gate" 3 is computed:
4
with 5 denoting the elementwise sigmoid and 6 trainable. Each gate modulates how much to trust view 7 for each token.
The final multi-view representation is an elementwise fusion,
8
where 9 is the Hadamard product. 0 is then aggregated (via pooling) and input to a shallow MLP for respiratory class prediction.
This mechanism permits the model to dynamically select the most informative view for every local spectrogram region, facilitating adaptive focus on spectro-temporal structures salient to the task.
5. Experimental Protocol and Benchmark Results
MVST was validated on the ICBHI respiratory-sound corpus (6,898 annotated breathing cycles, sampled at 4–44.1 kHz), using a patient-level split: 60% for training, 40% for testing per challenge protocol. The principal evaluation metrics were specificity (SP), sensitivity (SE), and average score (AS = [SP + SE]/2).
Key performance results:
| Model | Specificity (SP) | Sensitivity (SE) | Average Score (AS) |
|---|---|---|---|
| MVST | 81.99% | 51.10% | 66.55% |
| Previous best (AST + patch-mix CL) | 81.66% | 43.01% | 62.37% |
Training employed cross-entropy loss, AdamW optimizer (learning rate 1, weight decay 2), 50 epochs, and batch size 8. MVST improved average score by +4.18% over strong baselines, predominantly via marked sensitivity gains (+8.09%).
6. Role of Multi-View Patching and Gated Fusion Versus Image-Style Modeling
Spectrograms fundamentally differ from natural images: frequency and time axes have distinct physical semantics, and the mel scale is non-uniform, especially at high frequencies. Conventional ViT patching (3) can be misaligned—either too wide in time (blurring rapid transients) or too narrow in frequency (missing shifts of relevant spectral content).
Multi-view patching forms a granular "pyramid" ensuring that at least one view can capture each acoustic event optimally. Gated fusion adaptively adjudicates which views contribute most per instance and per dimension, rather than weighting all equally. This design permits robust extraction of respiratory phenomena, especially under real-world noise.
Empirically, this approach yields improved detection and discrimination of crackles, wheezes, and their combinations, validating the benefit of embedding spectrogram physics directly into the model inductive bias.
7. Significance in Time–Frequency Audio Modeling
MVST demonstrates the importance of physically-informed architectural choices for audio classification, especially where events of interest manifest at widely differing time–frequency scales. Embedding multiple patch granularities and adaptive fusion into the transformer framework bridges spectrogram physics with powerful attention-based modeling, providing both enhanced theoretical insight and empirically superior results on benchmark respiratory-sound classification (He et al., 2023).