CogViT: Respiratory Sound Classifier

Updated 6 May 2026

CogViT is an automated respiratory sound classifier that combines a 64-channel gammatone-based cochleogram with a six-layer Vision Transformer.
It leverages advanced time–frequency representation and global self-attention to outperform traditional CNNs on binary and multiclass tasks.
Using patient-wise 10-fold cross-validation on the ICBHI dataset, CogViT demonstrates statistically significant improvements in detection accuracy.

CogViT is an automated adventitious respiratory sound classification model integrating biologically inspired cochleogram front-end representations with a Vision Transformer (ViT) back-end architecture. The design demonstrates state-of-the-art performance on the ICBHI respiratory sound dataset, outperforming established convolutional neural network (CNN) baselines and alternative time–frequency (TF) representations. CogViT’s methodology is grounded in the use of a 64-channel gammatone-based cochleogram and an unmodified light ViT with six transformer encoder layers, yielding statistically significant performance improvements for both binary and multiclass respiratory sound classification tasks (Mang et al., 2024).

1. Cochleogram Front-End Signal Representation

CogViT replaces conventional linear-scale short-time Fourier transform (STFT) and mel-cepstrum representations with a cochleogram, constructed using a 64-channel gammatone filter bank to approximate the human cochlea’s frequency analysis.

Each filter’s impulse response is modeled as:

$g(t) = t^{o-1} \cdot e^{-2\pi b(f_c) t} \cdot \cos(2\pi f_c t), \quad t > 0$

where $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ , $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ , and filter order $o = 4$ . The center frequencies $f_c(k)$ are spaced uniformly on the ERB scale from 100 Hz to 8 kHz.

The envelope energy in each channel is calculated over sliding windows of length $N$ (hop $J$ samples), forming the cochleogram matrix:

$C(k, m) = \sum_{n=0}^{N-1} |y_k(n + mJ)| \cdot w(n)$

with $y_k$ as the $k^\text{th}$ filter output and $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 0 a window function (e.g., Hann). Standard practice suggests a 50% overlap, though the specific overlap value is not detailed.

2. Vision Transformer Architecture

Patch Embedding: The cochleogram (dimensions approximately $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 1) is partitioned into fixed-size patches (e.g., $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 2), flattened, and linearly projected into $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 3 dimensional embeddings.
Positional Encoding: A learned 1D positional embedding (length equal to the number of patches plus one) is added, with a [CLS] token prepended for classification.
Encoder Stack: CogViT employs $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 4 transformer encoder layers, each with multi-head self-attention ( $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 5 heads, head dimension $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 6), a two-layer feedforward network (hidden size $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 7), residual connections, LayerNorm, and dropout $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 8 after both attention and MLP sublayers.
Classification Head: The final [CLS] hidden state passes through a single-layer MLP ( $b(f_c) = 1.019 \cdot \text{ERB}(f_c)$ 9, where $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 0 for binary or $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 1 for multiclass), followed by a softmax function.

No modifications are introduced to the transformer block sizes or hyperparameters relative to the original ViT configuration. Hyperparameter ablation for $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 2, $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 3, or $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 4 is not reported.

3. Training Regimen and Data Preprocessing

Training utilizes the ICBHI 2017 respiratory sound database: 920 audio recordings from 126 patients. All audio is downsampled to 4 kHz and segmented into fixed-duration 6 s windows, zero-padded as needed.

Cross-validation: Patient-wise 10-fold cross-validation is employed (i.e., each fold forms a test set of unseen patients).
Optimizer and Hyperparameters: Adam optimizer with fixed learning rate $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 5, batch size $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 6, $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 7 training epochs per fold. No on-the-fly data augmentation is applied. Early stopping is not described.
Loss Function: Cross-entropy over the target classes with dropout $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 8 in the transformer blocks as the only explicit regularization.

4. Empirical Results and Comparative Analysis

CogViT is systematically evaluated in both two-class (wheezes-vs-others, crackles-vs-others) and four-class tasks (normal, wheeze, crackle, wheeze+crackle). All metrics are averaged across the 10 validation folds.

Binary Classification – Wheeze vs. Normal:

TF	Acc	Sens	Spec	Score	Prec
STFT	82.1	65.1	70.1	67.6	44.8
MFCC	79.9	62.8	68.9	65.8	39.5
CQT	78.8	61.7	68.3	65.0	37.6
Cochleogram	85.9	76.0	91.0	83.5	57.6

Crackle Detection: Cochleogram+ViT achieves 75.5% accuracy, sensitivity 65.2%, specificity 80.2%, score 72.7%, and precision 57.6%.
Four-Class Task: Accuracy 67.99%, sensitivity 56.6%, specificity 71.3%, score 64.0%, precision 50.2%.

CogViT outperforms CNN baselines (BaselineCNN, AlexNet, VGG16, ResNet50) on all evaluated TF representations (STFT, MFCC, CQT, cochleogram), achieving performance gains of 2–8 percentage points in accuracy.

Statistical significance is established via Mann–Whitney U and Wilcoxon signed-rank tests ( $\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)$ 9), with all comparisons to baselines yielding $o = 4$ 0.

5. Ablation and Sensitivity Analyses

Time–frequency representation ablation: For ViT, replacement of STFT with cochleogram confers an improvement of +4.1 percentage points in wheeze detection accuracy and +2.3 percentage points in crackle detection. MFCC and CQT representations underperform STFT by approximately 2–3 percentage points and are 6–10 points behind cochleogram.
Transformer architecture ablation: No variations to layer depth $o = 4$ 1, attention heads $o = 4$ 2, or model dimension $o = 4$ 3 are explored; all experiments retain $o = 4$ 4, $o = 4$ 5, $o = 4$ 6 as in the initial design.

6. Significance, Limitations, and Implications

CogViT demonstrates the strongest empirical performance for adventitious sound classification on the ICBHI corpus when compared to standard CNNs and alternative TF representations. The use of a cochleogram serves as a critical factor, leveraging the biologically informed filtering for more separable representation of adventitious sound events. The transformer-based backend with global self-attention provides additional advantages over local receptive field CNNs when parsing this specific TF structure.

Performance advantages are statistically significant as evidenced by formal nonparametric testing. No explicit regularization strategies, data augmentation, or early stopping protocols are incorporated, which suggests that further gains may be achievable with such enhancements.

A plausible implication is that incorporating cochlea-inspired TF representations with transformer architectures could generalize to other medical audio or event detection domains where spectral sparsity and temporal structure are informative.

7. Context within Respiratory Sound Analysis

CogViT represents the inaugural application of cochleogram-ViT integration to adventitious sound recognition, extending prior work that relied primarily on CNN classifiers and mainstream TF representations such as STFT, MFCC, or CQT. Its utility is reinforced by the statistically robust outperformance across diverse class definitions, suggesting a new methodological baseline for future studies in respiratory-sound-based disease detection (Mang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Classification of Adventitious Sounds Combining Cochleogram and Vision Transformers (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CogViT.

CogViT: Respiratory Sound Classifier

1. Cochleogram Front-End Signal Representation

2. Vision Transformer Architecture

3. Training Regimen and Data Preprocessing

4. Empirical Results and Comparative Analysis

5. Ablation and Sensitivity Analyses

6. Significance, Limitations, and Implications

7. Context within Respiratory Sound Analysis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CogViT: Respiratory Sound Classifier

1. Cochleogram Front-End Signal Representation

2. Vision Transformer Architecture

3. Training Regimen and Data Preprocessing

4. Empirical Results and Comparative Analysis

5. Ablation and Sensitivity Analyses

6. Significance, Limitations, and Implications

7. Context within Respiratory Sound Analysis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research