SincQDR-VAD: Noise-Robust Lightweight VAD
- The paper introduces a noise-robust VAD framework that integrates a learnable Sinc-extractor with a novel quadratic disparity ranking loss to directly optimize ranking metrics.
- The framework employs an interpretable filterbank design and reduces parameters by 69% compared to prior models while achieving superior AUROC and F2-scores on diverse noisy benchmarks.
- Empirical evaluations demonstrate that SincQDR-VAD efficiently handles noisy and resource-limited scenarios through robust training techniques and a split-transform-merge neural architecture.
SincQDR-VAD is a noise-robust and lightweight voice activity detection (VAD) framework that integrates a Sinc-extractor front-end with a novel quadratic disparity ranking (QDR) loss for end-to-end discriminative learning. The framework is designed to enhance the robustness of VAD in noisy and resource-limited scenarios by learning noise-resistant spectral features and explicitly optimizing pairwise score orderings between speech and non-speech frames, thus directly improving ranking-based performance metrics such as AUROC. SincQDR-VAD is distinguished by its parameter efficiency, utilizing only 8,000 parameters (69% fewer than prior lightweight models), empirical gains in both AUROC and -score on diverse benchmarks, and practical suitability for deployment on resource-constrained devices (Wang et al., 28 Aug 2025).
1. Sinc-Extractor Front End
The Sinc-extractor forms the initial stage of the SincQDR-VAD pipeline, processing raw audio waveform data through a bank of learnable band-pass filters constructed using parametrized sinc functions. This approach directly replaces conventional mel-filterbanks and standard 1D convolutions, yielding more interpretable and noise-adaptive sub-band log-energy features.
For the -th audio frame (25 ms length, 10 ms hop), the filterbank comprises learnable filters. The -th band log-energy is computed as
where denotes convolution and is the filter impulse response. Each is synthesized as the difference between two ideal low-pass sinc functions, parameterized by end-to-end learnable cut-off frequencies (low) and (high): with . After shifting by and truncation to samples, along with modulation by a learnable band gain and a fixed Hamming window : resulting in the filterbank optimized jointly during training. The log-energies exhibit both interpretability and adaptability to environmental noise, with learned filter responses concentrating on discriminative spectral regions rather than fixed mel spacings.
2. Quadratic Disparity Ranking Loss
Standard per-frame losses, such as binary cross-entropy (BCE), do not directly optimize ranking-based metrics that predominate in VAD evaluation. The QDR loss, introduced in SincQDR-VAD, is motivated by the objective of maximizing AUROC by preferentially optimizing the ordering of speech versus non-speech frame scores.
For speech (positive) and non-speech (negative) frame index sets and , and output probability per frame , the QDR loss with margin is: A positive–negative frame pair incurs penalty if , with the squared margin slack enforcing smooth gradients and stable optimization. To maintain calibration, the QDR loss is interpolated with standard BCE using a fixed weight determined by validation: This hybrid objective enforces both correct pairwise ranking and probability fidelity, directly enhancing AUROC and recall-oriented metrics.
3. Neural Network Architecture
The SincQDR-VAD neural network processes framed audio (16 kHz, 25 ms frame/10 ms hop) via a 64-channel Sinc-extractor, yielding log-energy features . A patchify module applies a non-overlapping 2D convolution that restructures into temporal–frequency "patches," reducing resolution but preserving local structure.
Three encoder layers employ a split-transform-merge design with parallel local and skip paths:
- The local path uses a depth-wise convolution followed by grouped point-wise convolutions (group size 8).
- The skip path is either identity or minimal projection.
- Features are concatenated and then passed through batch normalization and nonlinear activation. Residual connections are applied throughout.
The classifier head consists of global average pooling over time, a linear layer, and sigmoid activation to produce frame-level VAD probabilities.
4. Training Regimen and Optimization
SincQDR-VAD is trained on the SCF corpus (Speech Commands V2 and Freesound noise), comprising 105,000 clean one-second utterances and 2,800 noise clips, with an 8:1:1 train/validation/test split. Speech region labels correspond to 0.2–0.83 s in each clip, with label strides of 0.15 s. Evaluation benchmarks include AVA-Speech (various SNRs and music conditions) and ACAM (environmental recordings).
Preprocessing involves 25 ms frames (10 ms hop), Sinc filters, random time shifts ( ms, 80% probability), and additive white noise amplitude sampled uniformly from [−90 dB, −46 dB].
Optimization is performed with SGD (momentum 0.9, weight decay 0.001), batch size 256, and 150 epochs. The learning rate schedule consists of 5% linear warm-up, 45% hold, and 50% polynomial decay. QDR margin and interpolation weight are set throughout.
5. Empirical Performance
Evaluation metrics include AUROC (primary), -score (recall emphasis, threshold 0.5), and total parameter count. Experimental results demonstrate that SincQDR-VAD provides significant improvements over prior lightweight models on noisier and real-world benchmarks.
| Model | AVA AUROC | AVA | Params (k) |
|---|---|---|---|
| SincQDR-VAD | 0.914 | 0.911 | 8 |
| TinyVAD | 0.864 | 0.645 | 11.6 |
| MarbleNet | 0.858 | 0.635 | 88.9 |
| ResectNet | 0.900 | – | 11.1 |
On AVA-Speech with added ESC-50 noise at various SNRs, SincQDR-VAD achieves an average AUROC of 0.815 compared to 0.799 (TinyVAD) and 0.747 (MarbleNet). For real-world ACAM recordings:
- SincQDR-VAD AUROC: 0.97, : 0.92
- TinyVAD: AUROC 0.96, 0.65
- MarbleNet: AUROC 0.90, 0.44
Ablation studies reveal that both the Sinc-extractor and the QDR loss provide substantial and complementary gains: removing either component severely degrades AUROC on both clean and noisy conditions.
6. Robustness and Efficiency
The learnable Sinc filters in the front end enable adaptive allocation of spectral bandwidths to emphasize speech-dominant frequency regions and attenuate noise-prone bands. Empirical analysis (see Fig. 6–7 in the source) indicates that learned cutoff frequencies deviate from uniform mel spacing and focus on maximally discriminative ranges.
With only 8,000 parameters, SincQDR-VAD is approximately 69% smaller than comparable lightweight architectures. Contributing factors include the shallow, split-transform-merge encoder and direct time-domain filtering. The framework is suitable for real-time voice activity detection on edge and resource-limited hardware, with no need for computationally expensive spectrogram or mel layers.
7. Implementation Specifics and Reproducibility
Key implementation details:
- Frame length and hop: 25 ms and 10 ms at 16 kHz
- Filterbank: Sinc filters; filter length
- QDR margin: ; hybrid loss weight:
- Data augmentation: 80% probability of time-shift, amplitude noise sampled from [−90,−46] dB
- Optimization: SGD (momentum 0.9, weight decay 0.001), batch size 256, 150 epochs, Warmup-Hold-Decay schedule
Source code and pretrained models are provided at https://github.com/JethroWangSir/SincQDR-VAD, facilitating reproduction and further study (Wang et al., 28 Aug 2025).