Multi-View Spectrogram Transformers

Updated 22 June 2026

The paper introduces a Multi-View Spectrogram Transformer (MVST) that integrates multi-scale patching and adaptive gated fusion to capture detailed respiratory sound features.
MVST leverages five unique spectrogram views to effectively represent diverse acoustic events like crackles and wheezes through tailored time-frequency resolutions.
Empirical results on the ICBHI dataset demonstrate significant sensitivity gains and an improved average score, outperforming previous methods.

The Multi-View Spectrogram Transformer (MVST) is a deep learning framework for respiratory sound classification that integrates time–frequency acoustical characteristics into a vision transformer architecture via multi-scale spectrogram patching and adaptive feature fusion. Contrasting prior approaches treating spectrograms as generic images, MVST exploits their unique physical structure to better capture salient respiratory events such as crackles and wheezes, achieving state-of-the-art performance on the ICBHI respiratory sound dataset (He et al., 2023).

1. Multi-View Patch Construction of Mel-Spectrograms

MVST operates on mel-spectrograms $S$ with dimensions $H \times W$ (where $H=256$ frequency bins, $W=1\,024$ time frames). Unlike standard vision transformers (ViT) which apply a single fixed-size square patch (e.g., $16 \times 16$ ), MVST decomposes $S$ into five distinct tilings ("views") using non-overlapping patches:

View ( $\ell$ )	Patch Size ( $\mu_\ell \times \nu_\ell$ )	Physical Resolution
0	$256 \times 1$	Full frequency, 1 time step
1	$128 \times 2$	128 freq, 2 time steps
2	$H \times W$ 0	64 freq, 4 time steps
3	$H \times W$ 1	32 freq, 8 time steps
4	$H \times W$ 2	16 freq, 16 time steps

For all $H \times W$ 3, $H \times W$ 4 is held constant. This pyramid captures trade-offs between temporal and frequency resolution: $H \times W$ 5 captures instantaneous full-band spectra, while $H \times W$ 6 captures local time–frequency neighborhoods.

The motivation is that physiological events—e.g., brief crackles or sustained wheezes—differ in their spectro-temporal localization, and a single patch scale is typically suboptimal for all event types. Multi-view patching grants robustness to frequency shifts, enables focus at multiple granularities, and more faithfully models the physical characteristics of sounds in the mel domain.

2. Positional Embedding and Patch Tokenization

Each patch in view $H \times W$ 7 ( $H \times W$ 8) is flattened to a vector of length $H \times W$ 9, and projected to a $H=256$ 0-dimensional token via a learnable linear mapping: $H=256$ 1 where $H=256$ 2.

To preserve the temporal (time interval) and spectral (frequency band) information, MVST introduces a learnable positional embedding $H=256$ 3 for each view. The input to the first transformer block per view is then: $H=256$ 4 This annotation ensures the model maintains awareness of the location of each patch in the original spectrogram, both in time and frequency.

3. Transformer Encoder Architecture

For each view $H=256$ 5, $H=256$ 6 is passed through $H=256$ 7 stacked Transformer blocks, each including Layer Normalization (LN), Multi-Head Self-Attention (MSA), and a two-layer Feed-Forward Network (FFN), with residual connections for both sublayers. Formally, for block $H=256$ 8:

Attention sublayer:

$H=256$ 9

FFN sublayer:

$W=1\,024$ 0

MSA uses scaled-dot product attention for each head: $W=1\,024$ 1 where $W=1\,024$ 2, $W=1\,024$ 3, $W=1\,024$ 4 and $W=1\,024$ 5.

The FFN is implemented as: $W=1\,024$ 6 where $W=1\,024$ 7 is the GELU nonlinearity, and typically $W=1\,024$ 8, $W=1\,024$ 9, with $16 \times 16$ 0.

Each view independently produces a feature map $16 \times 16$ 1 after $16 \times 16$ 2 layers.

4. Adaptive Gated Fusion Across Multi-View Features

To aggregate evidence from the five patch resolutions, MVST employs a gated fusion mechanism. For each view, a "gate" $16 \times 16$ 3 is computed:

$16 \times 16$ 4

with $16 \times 16$ 5 denoting the elementwise sigmoid and $16 \times 16$ 6 trainable. Each gate modulates how much to trust view $16 \times 16$ 7 for each token.

The final multi-view representation is an elementwise fusion,

$16 \times 16$ 8

where $16 \times 16$ 9 is the Hadamard product. $S$ 0 is then aggregated (via pooling) and input to a shallow MLP for respiratory class prediction.

This mechanism permits the model to dynamically select the most informative view for every local spectrogram region, facilitating adaptive focus on spectro-temporal structures salient to the task.

5. Experimental Protocol and Benchmark Results

MVST was validated on the ICBHI respiratory-sound corpus (6,898 annotated breathing cycles, sampled at 4–44.1 kHz), using a patient-level split: 60% for training, 40% for testing per challenge protocol. The principal evaluation metrics were specificity (SP), sensitivity (SE), and average score (AS = [SP + SE]/2).

Key performance results:

Model	Specificity (SP)	Sensitivity (SE)	Average Score (AS)
MVST	81.99%	51.10%	66.55%
Previous best (AST + patch-mix CL)	81.66%	43.01%	62.37%

Training employed cross-entropy loss, AdamW optimizer (learning rate $S$ 1, weight decay $S$ 2), 50 epochs, and batch size 8. MVST improved average score by +4.18% over strong baselines, predominantly via marked sensitivity gains (+8.09%).

6. Role of Multi-View Patching and Gated Fusion Versus Image-Style Modeling

Spectrograms fundamentally differ from natural images: frequency and time axes have distinct physical semantics, and the mel scale is non-uniform, especially at high frequencies. Conventional ViT patching ( $S$ 3) can be misaligned—either too wide in time (blurring rapid transients) or too narrow in frequency (missing shifts of relevant spectral content).

Multi-view patching forms a granular "pyramid" ensuring that at least one view can capture each acoustic event optimally. Gated fusion adaptively adjudicates which views contribute most per instance and per dimension, rather than weighting all equally. This design permits robust extraction of respiratory phenomena, especially under real-world noise.

Empirically, this approach yields improved detection and discrimination of crackles, wheezes, and their combinations, validating the benefit of embedding spectrogram physics directly into the model inductive bias.

7. Significance in Time–Frequency Audio Modeling

MVST demonstrates the importance of physically-informed architectural choices for audio classification, especially where events of interest manifest at widely differing time–frequency scales. Embedding multiple patch granularities and adaptive fusion into the transformer framework bridges spectrogram physics with powerful attention-based modeling, providing both enhanced theoretical insight and empirically superior results on benchmark respiratory-sound classification (He et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Multi-View Spectrogram Transformer for Respiratory Sound Classification (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-View Spectrogram Transformers (MVST).