CogViT: Respiratory Sound Classifier
- CogViT is an automated respiratory sound classifier that combines a 64-channel gammatone-based cochleogram with a six-layer Vision Transformer.
- It leverages advanced time–frequency representation and global self-attention to outperform traditional CNNs on binary and multiclass tasks.
- Using patient-wise 10-fold cross-validation on the ICBHI dataset, CogViT demonstrates statistically significant improvements in detection accuracy.
CogViT is an automated adventitious respiratory sound classification model integrating biologically inspired cochleogram front-end representations with a Vision Transformer (ViT) back-end architecture. The design demonstrates state-of-the-art performance on the ICBHI respiratory sound dataset, outperforming established convolutional neural network (CNN) baselines and alternative time–frequency (TF) representations. CogViT’s methodology is grounded in the use of a 64-channel gammatone-based cochleogram and an unmodified light ViT with six transformer encoder layers, yielding statistically significant performance improvements for both binary and multiclass respiratory sound classification tasks (Mang et al., 2024).
1. Cochleogram Front-End Signal Representation
CogViT replaces conventional linear-scale short-time Fourier transform (STFT) and mel-cepstrum representations with a cochleogram, constructed using a 64-channel gammatone filter bank to approximate the human cochlea’s frequency analysis.
Each filter’s impulse response is modeled as:
where , , and filter order . The center frequencies are spaced uniformly on the ERB scale from 100 Hz to 8 kHz.
The envelope energy in each channel is calculated over sliding windows of length (hop samples), forming the cochleogram matrix:
with as the filter output and 0 a window function (e.g., Hann). Standard practice suggests a 50% overlap, though the specific overlap value is not detailed.
2. Vision Transformer Architecture
- Patch Embedding: The cochleogram (dimensions approximately 1) is partitioned into fixed-size patches (e.g., 2), flattened, and linearly projected into 3 dimensional embeddings.
- Positional Encoding: A learned 1D positional embedding (length equal to the number of patches plus one) is added, with a [CLS] token prepended for classification.
- Encoder Stack: CogViT employs 4 transformer encoder layers, each with multi-head self-attention (5 heads, head dimension 6), a two-layer feedforward network (hidden size 7), residual connections, LayerNorm, and dropout 8 after both attention and MLP sublayers.
- Classification Head: The final [CLS] hidden state passes through a single-layer MLP (9, where 0 for binary or 1 for multiclass), followed by a softmax function.
No modifications are introduced to the transformer block sizes or hyperparameters relative to the original ViT configuration. Hyperparameter ablation for 2, 3, or 4 is not reported.
3. Training Regimen and Data Preprocessing
Training utilizes the ICBHI 2017 respiratory sound database: 920 audio recordings from 126 patients. All audio is downsampled to 4 kHz and segmented into fixed-duration 6 s windows, zero-padded as needed.
- Cross-validation: Patient-wise 10-fold cross-validation is employed (i.e., each fold forms a test set of unseen patients).
- Optimizer and Hyperparameters: Adam optimizer with fixed learning rate 5, batch size 6, 7 training epochs per fold. No on-the-fly data augmentation is applied. Early stopping is not described.
- Loss Function: Cross-entropy over the target classes with dropout 8 in the transformer blocks as the only explicit regularization.
4. Empirical Results and Comparative Analysis
CogViT is systematically evaluated in both two-class (wheezes-vs-others, crackles-vs-others) and four-class tasks (normal, wheeze, crackle, wheeze+crackle). All metrics are averaged across the 10 validation folds.
Binary Classification – Wheeze vs. Normal:
| TF | Acc | Sens | Spec | Score | Prec |
|---|---|---|---|---|---|
| STFT | 82.1 | 65.1 | 70.1 | 67.6 | 44.8 |
| MFCC | 79.9 | 62.8 | 68.9 | 65.8 | 39.5 |
| CQT | 78.8 | 61.7 | 68.3 | 65.0 | 37.6 |
| Cochleogram | 85.9 | 76.0 | 91.0 | 83.5 | 57.6 |
- Crackle Detection: Cochleogram+ViT achieves 75.5% accuracy, sensitivity 65.2%, specificity 80.2%, score 72.7%, and precision 57.6%.
- Four-Class Task: Accuracy 67.99%, sensitivity 56.6%, specificity 71.3%, score 64.0%, precision 50.2%.
CogViT outperforms CNN baselines (BaselineCNN, AlexNet, VGG16, ResNet50) on all evaluated TF representations (STFT, MFCC, CQT, cochleogram), achieving performance gains of 2–8 percentage points in accuracy.
Statistical significance is established via Mann–Whitney U and Wilcoxon signed-rank tests (9), with all comparisons to baselines yielding 0.
5. Ablation and Sensitivity Analyses
- Time–frequency representation ablation: For ViT, replacement of STFT with cochleogram confers an improvement of +4.1 percentage points in wheeze detection accuracy and +2.3 percentage points in crackle detection. MFCC and CQT representations underperform STFT by approximately 2–3 percentage points and are 6–10 points behind cochleogram.
- Transformer architecture ablation: No variations to layer depth 1, attention heads 2, or model dimension 3 are explored; all experiments retain 4, 5, 6 as in the initial design.
6. Significance, Limitations, and Implications
CogViT demonstrates the strongest empirical performance for adventitious sound classification on the ICBHI corpus when compared to standard CNNs and alternative TF representations. The use of a cochleogram serves as a critical factor, leveraging the biologically informed filtering for more separable representation of adventitious sound events. The transformer-based backend with global self-attention provides additional advantages over local receptive field CNNs when parsing this specific TF structure.
Performance advantages are statistically significant as evidenced by formal nonparametric testing. No explicit regularization strategies, data augmentation, or early stopping protocols are incorporated, which suggests that further gains may be achievable with such enhancements.
A plausible implication is that incorporating cochlea-inspired TF representations with transformer architectures could generalize to other medical audio or event detection domains where spectral sparsity and temporal structure are informative.
7. Context within Respiratory Sound Analysis
CogViT represents the inaugural application of cochleogram-ViT integration to adventitious sound recognition, extending prior work that relied primarily on CNN classifiers and mainstream TF representations such as STFT, MFCC, or CQT. Its utility is reinforced by the statistically robust outperformance across diverse class definitions, suggesting a new methodological baseline for future studies in respiratory-sound-based disease detection (Mang et al., 2024).