Papers
Topics
Authors
Recent
Search
2000 character limit reached

CogViT: Respiratory Sound Classifier

Updated 6 May 2026
  • CogViT is an automated respiratory sound classifier that combines a 64-channel gammatone-based cochleogram with a six-layer Vision Transformer.
  • It leverages advanced time–frequency representation and global self-attention to outperform traditional CNNs on binary and multiclass tasks.
  • Using patient-wise 10-fold cross-validation on the ICBHI dataset, CogViT demonstrates statistically significant improvements in detection accuracy.

CogViT is an automated adventitious respiratory sound classification model integrating biologically inspired cochleogram front-end representations with a Vision Transformer (ViT) back-end architecture. The design demonstrates state-of-the-art performance on the ICBHI respiratory sound dataset, outperforming established convolutional neural network (CNN) baselines and alternative time–frequency (TF) representations. CogViT’s methodology is grounded in the use of a 64-channel gammatone-based cochleogram and an unmodified light ViT with six transformer encoder layers, yielding statistically significant performance improvements for both binary and multiclass respiratory sound classification tasks (Mang et al., 2024).

1. Cochleogram Front-End Signal Representation

CogViT replaces conventional linear-scale short-time Fourier transform (STFT) and mel-cepstrum representations with a cochleogram, constructed using a 64-channel gammatone filter bank to approximate the human cochlea’s frequency analysis.

Each filter’s impulse response is modeled as:

g(t)=to1e2πb(fc)tcos(2πfct),t>0g(t) = t^{o-1} \cdot e^{-2\pi b(f_c) t} \cdot \cos(2\pi f_c t), \quad t > 0

where b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c), ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1), and filter order o=4o = 4. The center frequencies fc(k)f_c(k) are spaced uniformly on the ERB scale from 100 Hz to 8 kHz.

The envelope energy in each channel is calculated over sliding windows of length NN (hop JJ samples), forming the cochleogram matrix:

C(k,m)=n=0N1yk(n+mJ)w(n)C(k, m) = \sum_{n=0}^{N-1} |y_k(n + mJ)| \cdot w(n)

with yky_k as the kthk^\text{th} filter output and b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)0 a window function (e.g., Hann). Standard practice suggests a 50% overlap, though the specific overlap value is not detailed.

2. Vision Transformer Architecture

  • Patch Embedding: The cochleogram (dimensions approximately b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)1) is partitioned into fixed-size patches (e.g., b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)2), flattened, and linearly projected into b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)3 dimensional embeddings.
  • Positional Encoding: A learned 1D positional embedding (length equal to the number of patches plus one) is added, with a [CLS] token prepended for classification.
  • Encoder Stack: CogViT employs b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)4 transformer encoder layers, each with multi-head self-attention (b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)5 heads, head dimension b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)6), a two-layer feedforward network (hidden size b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)7), residual connections, LayerNorm, and dropout b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)8 after both attention and MLP sublayers.
  • Classification Head: The final [CLS] hidden state passes through a single-layer MLP (b(fc)=1.019ERB(fc)b(f_c) = 1.019 \cdot \text{ERB}(f_c)9, where ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)0 for binary or ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)1 for multiclass), followed by a softmax function.

No modifications are introduced to the transformer block sizes or hyperparameters relative to the original ViT configuration. Hyperparameter ablation for ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)2, ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)3, or ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)4 is not reported.

3. Training Regimen and Data Preprocessing

Training utilizes the ICBHI 2017 respiratory sound database: 920 audio recordings from 126 patients. All audio is downsampled to 4 kHz and segmented into fixed-duration 6 s windows, zero-padded as needed.

  • Cross-validation: Patient-wise 10-fold cross-validation is employed (i.e., each fold forms a test set of unseen patients).
  • Optimizer and Hyperparameters: Adam optimizer with fixed learning rate ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)5, batch size ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)6, ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)7 training epochs per fold. No on-the-fly data augmentation is applied. Early stopping is not described.
  • Loss Function: Cross-entropy over the target classes with dropout ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)8 in the transformer blocks as the only explicit regularization.

4. Empirical Results and Comparative Analysis

CogViT is systematically evaluated in both two-class (wheezes-vs-others, crackles-vs-others) and four-class tasks (normal, wheeze, crackle, wheeze+crackle). All metrics are averaged across the 10 validation folds.

Binary Classification – Wheeze vs. Normal:

TF Acc Sens Spec Score Prec
STFT 82.1 65.1 70.1 67.6 44.8
MFCC 79.9 62.8 68.9 65.8 39.5
CQT 78.8 61.7 68.3 65.0 37.6
Cochleogram 85.9 76.0 91.0 83.5 57.6
  • Crackle Detection: Cochleogram+ViT achieves 75.5% accuracy, sensitivity 65.2%, specificity 80.2%, score 72.7%, and precision 57.6%.
  • Four-Class Task: Accuracy 67.99%, sensitivity 56.6%, specificity 71.3%, score 64.0%, precision 50.2%.

CogViT outperforms CNN baselines (BaselineCNN, AlexNet, VGG16, ResNet50) on all evaluated TF representations (STFT, MFCC, CQT, cochleogram), achieving performance gains of 2–8 percentage points in accuracy.

Statistical significance is established via Mann–Whitney U and Wilcoxon signed-rank tests (ERB(fc)=24.7(4.37fc/1000+1)\text{ERB}(f_c) = 24.7 \cdot (4.37 \cdot f_c / 1000 + 1)9), with all comparisons to baselines yielding o=4o = 40.

5. Ablation and Sensitivity Analyses

  • Time–frequency representation ablation: For ViT, replacement of STFT with cochleogram confers an improvement of +4.1 percentage points in wheeze detection accuracy and +2.3 percentage points in crackle detection. MFCC and CQT representations underperform STFT by approximately 2–3 percentage points and are 6–10 points behind cochleogram.
  • Transformer architecture ablation: No variations to layer depth o=4o = 41, attention heads o=4o = 42, or model dimension o=4o = 43 are explored; all experiments retain o=4o = 44, o=4o = 45, o=4o = 46 as in the initial design.

6. Significance, Limitations, and Implications

CogViT demonstrates the strongest empirical performance for adventitious sound classification on the ICBHI corpus when compared to standard CNNs and alternative TF representations. The use of a cochleogram serves as a critical factor, leveraging the biologically informed filtering for more separable representation of adventitious sound events. The transformer-based backend with global self-attention provides additional advantages over local receptive field CNNs when parsing this specific TF structure.

Performance advantages are statistically significant as evidenced by formal nonparametric testing. No explicit regularization strategies, data augmentation, or early stopping protocols are incorporated, which suggests that further gains may be achievable with such enhancements.

A plausible implication is that incorporating cochlea-inspired TF representations with transformer architectures could generalize to other medical audio or event detection domains where spectral sparsity and temporal structure are informative.

7. Context within Respiratory Sound Analysis

CogViT represents the inaugural application of cochleogram-ViT integration to adventitious sound recognition, extending prior work that relied primarily on CNN classifiers and mainstream TF representations such as STFT, MFCC, or CQT. Its utility is reinforced by the statistically robust outperformance across diverse class definitions, suggesting a new methodological baseline for future studies in respiratory-sound-based disease detection (Mang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CogViT.