SincQDR-VAD: Noise-Robust Lightweight VAD

Updated 9 December 2025

The paper introduces a noise-robust VAD framework that integrates a learnable Sinc-extractor with a novel quadratic disparity ranking loss to directly optimize ranking metrics.
The framework employs an interpretable filterbank design and reduces parameters by 69% compared to prior models while achieving superior AUROC and F2-scores on diverse noisy benchmarks.
Empirical evaluations demonstrate that SincQDR-VAD efficiently handles noisy and resource-limited scenarios through robust training techniques and a split-transform-merge neural architecture.

SincQDR-VAD is a noise-robust and lightweight voice activity detection (VAD) framework that integrates a Sinc-extractor front-end with a novel quadratic disparity ranking (QDR) loss for end-to-end discriminative learning. The framework is designed to enhance the robustness of VAD in noisy and resource-limited scenarios by learning noise-resistant spectral features and explicitly optimizing pairwise score orderings between speech and non-speech frames, thus directly improving ranking-based performance metrics such as AUROC. SincQDR-VAD is distinguished by its parameter efficiency, utilizing only 8,000 parameters (69% fewer than prior lightweight models), empirical gains in both AUROC and $F_2$ -score on diverse benchmarks, and practical suitability for deployment on resource-constrained devices (Wang et al., 28 Aug 2025).

1. Sinc-Extractor Front End

The Sinc-extractor forms the initial stage of the SincQDR-VAD pipeline, processing raw audio waveform data through a bank of learnable band-pass filters constructed using parametrized sinc functions. This approach directly replaces conventional mel-filterbanks and standard 1D convolutions, yielding more interpretable and noise-adaptive sub-band log-energy features.

For the $t$ -th audio frame $x_t[n]$ (25 ms length, 10 ms hop), the filterbank comprises $F$ learnable filters. The $i$ -th band log-energy is computed as

$\hat{x}_{t,i} = \log\Bigl(\sum_{n} |\,x_t[n]*s_i[n]|^2\Bigr), \quad i=1,\dots,F,$

where $*$ denotes convolution and $s_i[n]$ is the filter impulse response. Each $s_i[n]$ is synthesized as the difference between two ideal low-pass sinc functions, parameterized by end-to-end learnable cut-off frequencies $\omega^i_{c1}$ (low) and $\omega^i_{c2}$ (high): $\tilde{s}_i[n] = \frac{\omega^i_{c2}}{\pi}\,\mathrm{sinc}(\omega^i_{c2} n) - \frac{\omega^i_{c1}}{\pi}\,\mathrm{sinc}(\omega^i_{c1} n),$ with $\mathrm{sinc}(u) = \frac{\sin(u)}{u}$ . After shifting by $R$ and truncation to $L = 2R + 1$ samples, along with modulation by a learnable band gain $b_i$ and a fixed Hamming window $h[n]$ : $s_i[n] = b_i \, \hat{s}_i[n] \, h[n],$ resulting in the filterbank $\{s_i[n]\}_{i=1}^F$ optimized jointly during training. The log-energies exhibit both interpretability and adaptability to environmental noise, with learned filter responses concentrating on discriminative spectral regions rather than fixed mel spacings.

2. Quadratic Disparity Ranking Loss

Standard per-frame losses, such as binary cross-entropy (BCE), do not directly optimize ranking-based metrics that predominate in VAD evaluation. The QDR loss, introduced in SincQDR-VAD, is motivated by the objective of maximizing AUROC by preferentially optimizing the ordering of speech versus non-speech frame scores.

For speech (positive) and non-speech (negative) frame index sets $\mathcal{P}$ and $\mathcal{N}$ , and output probability $\hat y_k$ per frame $k$ , the QDR loss with margin $m>0$ is: $\mathcal{L}_{\mathrm{QDR}} = \frac{1}{|\mathcal P|}\frac{1}{|\mathcal N|} \;\sum_{i \in \mathcal P} \;\sum_{j \in \mathcal N} \left[ \max(0, m - (\hat y_i - \hat y_j)) \right]^2$ A positive–negative frame pair incurs penalty if $\hat y_i - \hat y_j < m$ , with the squared margin slack enforcing smooth gradients and stable optimization. To maintain calibration, the QDR loss is interpolated with standard BCE using a fixed weight $\lambda = 0.25$ determined by validation: $\mathcal{L}_{\mathrm{Total}} = \lambda\,\mathcal L_{\mathrm{QDR}} + (1-\lambda)\,\mathcal{L}_{\mathrm{BCE}}$ This hybrid objective enforces both correct pairwise ranking and probability fidelity, directly enhancing AUROC and recall-oriented metrics.

3. Neural Network Architecture

The SincQDR-VAD neural network processes framed audio (16 kHz, 25 ms frame/10 ms hop) via a 64-channel Sinc-extractor, yielding log-energy features $\hat{\mathbf X} \in \mathbb{R}^{T \times 64}$ . A patchify module applies a non-overlapping $8 \times 8$ 2D convolution that restructures $\hat{\mathbf X}$ into temporal–frequency "patches," reducing resolution but preserving local structure.

Three encoder layers employ a split-transform-merge design with parallel local and skip paths:

The local path uses a depth-wise $3 \times 3$ convolution followed by grouped point-wise convolutions (group size 8).
The skip path is either identity or minimal projection.
Features are concatenated and then passed through batch normalization and nonlinear activation. Residual connections are applied throughout.

The classifier head consists of global average pooling over time, a linear layer, and sigmoid activation to produce frame-level VAD probabilities.

4. Training Regimen and Optimization

SincQDR-VAD is trained on the SCF corpus (Speech Commands V2 and Freesound noise), comprising 105,000 clean one-second utterances and 2,800 noise clips, with an 8:1:1 train/validation/test split. Speech region labels correspond to 0.2–0.83 s in each clip, with label strides of 0.15 s. Evaluation benchmarks include AVA-Speech (various SNRs and music conditions) and ACAM (environmental recordings).

Preprocessing involves 25 ms frames (10 ms hop), $F = 64$ Sinc filters, random time shifts ( $\pm5$ ms, 80% probability), and additive white noise amplitude sampled uniformly from [−90 dB, −46 dB].

Optimization is performed with SGD (momentum 0.9, weight decay 0.001), batch size 256, and 150 epochs. The learning rate schedule consists of 5% linear warm-up, 45% hold, and 50% polynomial decay. QDR margin $m=1.0$ and interpolation weight $\lambda=0.25$ are set throughout.

5. Empirical Performance

Evaluation metrics include AUROC (primary), $F_2$ -score (recall emphasis, threshold 0.5), and total parameter count. Experimental results demonstrate that SincQDR-VAD provides significant improvements over prior lightweight models on noisier and real-world benchmarks.

Model	AVA AUROC	AVA $F_2$	Params (k)
SincQDR-VAD	0.914	0.911	8
TinyVAD	0.864	0.645	11.6
MarbleNet	0.858	0.635	88.9
ResectNet	0.900	–	11.1

On AVA-Speech with added ESC-50 noise at various SNRs, SincQDR-VAD achieves an average AUROC of 0.815 compared to 0.799 (TinyVAD) and 0.747 (MarbleNet). For real-world ACAM recordings:

SincQDR-VAD AUROC: 0.97, $F_2$ : 0.92
TinyVAD: AUROC 0.96, $F_2$ 0.65
MarbleNet: AUROC 0.90, $F_2$ 0.44

Ablation studies reveal that both the Sinc-extractor and the QDR loss provide substantial and complementary gains: removing either component severely degrades AUROC on both clean and noisy conditions.

6. Robustness and Efficiency

The learnable Sinc filters in the front end enable adaptive allocation of spectral bandwidths to emphasize speech-dominant frequency regions and attenuate noise-prone bands. Empirical analysis (see Fig. 6–7 in the source) indicates that learned cutoff frequencies deviate from uniform mel spacing and focus on maximally discriminative ranges.

With only 8,000 parameters, SincQDR-VAD is approximately 69% smaller than comparable lightweight architectures. Contributing factors include the shallow, split-transform-merge encoder and direct time-domain filtering. The framework is suitable for real-time voice activity detection on edge and resource-limited hardware, with no need for computationally expensive spectrogram or mel layers.

7. Implementation Specifics and Reproducibility

Key implementation details:

Frame length and hop: 25 ms and 10 ms at 16 kHz
Filterbank: $F = 64$ Sinc filters; filter length $L = 2R + 1$
QDR margin: $m = 1.0$ ; hybrid loss weight: $\lambda = 0.25$
Data augmentation: 80% probability of time-shift, amplitude noise sampled from [−90,−46] dB
Optimization: SGD (momentum 0.9, weight decay 0.001), batch size 256, 150 epochs, Warmup-Hold-Decay schedule

Source code and pretrained models are provided at https://github.com/JethroWangSir/SincQDR-VAD, facilitating reproduction and further study (Wang et al., 28 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SincQDR-VAD Framework.