Multi-Scale SB-CQT Discriminator for Vocoders

Updated 2 May 2026

MS-SB-CQTD is a neural module that applies multi-scale Constant-Q Transform spectrograms to enhance pitch accuracy and harmonic detail in vocoder systems.
It employs octave-wise sub-band processing with dedicated 2D CNNs to align temporal features across variable window lengths inherent in CQT representations.
Integration with GAN vocoders such as HiFi-GAN demonstrates measurable improvements in F0 accuracy and perceptual quality, enabling high-fidelity speech and singing synthesis.

The Multi-Scale Sub-Band Constant-Q Transform Discriminator (MS-SB-CQTD) is a neural network module designed for use as the principal adversarial discriminator in GAN-based high-fidelity vocoders. It operates on multiple Constant-Q Transform (CQT) spectrograms at different frequency resolutions, leverages a sub-band (octave-wise) grouping architecture to align temporal features, and provides complementary discrimination to Short-Time Fourier Transform (STFT)-based models, specifically enhancing pitch accuracy and harmonic detail in both speech and singing voice synthesis (Gu et al., 2024, Gu et al., 2023).

1. Mathematical Foundations: Constant-Q Transform

The MS-SB-CQTD operates on the CQT, a time–frequency representation characterized by geometrically spaced center frequencies and variable window lengths, with a constant quality factor $Q$ . Given a real input signal $x[n]$ , the CQT is defined by: $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ where $k$ indexes frequency bins, each associated with filter

$a_k[n] = \frac{1}{N_k\,w(n/N_k)} \exp\left(-i 2\pi \frac{Q_k}{N_k} n\right)$

with $w(\cdot)$ a window function (typically Hann), and $N_k$ the window length for bin $k$ . The filterbank is constructed with a constant Q-factor: $Q_k = \frac{f_k}{\Delta f_k} = \left(2^{1/B}-1\right)^{-1}$ where $B$ is the number of bins per octave, $x[n]$ 0 for a base frequency $x[n]$ 1 (commonly 32.7 Hz, C1), and $x[n]$ 2. Thus, lower frequencies yield narrow-band, long-duration filters and higher frequencies give wide-band, short-duration filters.

2. Sub-Band Processor (SBP): Octave-wise Grouping and Temporal Alignment

Due to the octave-varying filter lengths, CQT spectrograms exhibit temporal misalignment across frequency bins. The SBP module mitigates this by dividing the CQT (real and imaginary channels) into $x[n]$ 3 octave sub-bands, each corresponding to a set of frequency bins localized to one octave (for typical configurations, $x[n]$ 4 for 32.7 Hz to ~16 kHz). Each sub-band is processed by a dedicated 2D CNN (kernel size 3×9, two input channels), producing learned latents of size $x[n]$ 5 per band, with $x[n]$ 6. These are concatenated along the frequency axis, restoring temporal alignment across octaves and yielding a feature map of size $x[n]$ 7. This octave-wise grouping is essential for mitigating the desynchronization effects that arise in naive 2D convolutions over CQTs with frequency-dependent windows (Gu et al., 2024, Gu et al., 2023).

3. Multi-Scale CQT Architecture

MS-SB-CQTD employs $x[n]$ 8 sub-discriminators, each operating on a different CQT configuration to capture a spectrum of time–frequency resolutions:

Each scale $x[n]$ 9 uses $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 0 bins per octave
Small $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 1 (e.g., 24): long filters at low frequencies, high frequency resolution
Large $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 2 (e.g., 48): short filters, high time resolution at high frequencies

At each scale, the CQT is computed with global hop length 256 samples, yielding total bins $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 3.

4. Sub-Discriminator Neural Network Structures

Each of the $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 4 sub-discriminators has the following architecture:

SBP front-end: 2D CNN on each octave sub-band, as described above
Main Stack:
- Conv2D (kernel=3×8, stride=1×2, channels=32, LeakyReLU, WeightNorm)
- Three dilated Conv2D blocks (kernel=3×8, stride=1×2, dilations 1/2/4, channels=32, LeakyReLU, WeightNorm)
- Final Conv2D (kernel=3×3, stride=1×1, channels=1)
The output is a scalar, $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 5, obtained by averaging the final feature map over time and frequency.

A representative layout is shown below:

Component	Kernel	Stride	Dilations	Channels	Activation
SBP Conv	3×9	1×1	—	2→32	LeakyReLU(0.2)
Main Conv #1	3×8	1×2	1	32	LeakyReLU
Main Conv #2	3×8	1×2	2	32	LeakyReLU
Main Conv #3	3×8	1×2	4	32	LeakyReLU
Final Conv	3×3	1×1	—	1	—

(Gu et al., 2024)

5. Integration into GAN-Vocoder Training

MS-SB-CQTD is designed for drop-in integration with common GAN vocoders (e.g., HiFi-GAN, BigVGAN, APNet). The full adversarial training objective consists of:

Adversarial loss for discriminator $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 6 (for real $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 7 and generated $X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 8 inputs):

$X^{\mathrm{CQT}}(k, t) = \sum_{m = t - \lfloor N_k/2 \rfloor}^{t + \lfloor N_k/2 \rfloor} x[m]\, a_k^*\left(m - t + N_k/2\right)$ 9

Generator adversarial and feature-matching losses:

$k$ 0

$k$ 1

Total generator loss:

$k$ 2

Total discriminator loss:

$k$ 3

This setup can be freely extended to include multi-scale discriminators based on both STFT and CQT, with empirical results supporting joint usage for maximal synthesis quality (Gu et al., 2024, Gu et al., 2023).

6. Empirical Gains, Ablations, and Implementation Details

Extensive quantitative evaluation demonstrates that MS-SB-CQTD provides substantial improvements across objective and subjective metrics. In singing voice vocoding (HiFi-GAN baseline), integrating MS-SB-CQTD yields:

F0RMSE reduction from 56.96 Hz to 35.57 Hz
MOS improvement from 3.27 to 3.66 (seen singers)
F0 Pearson correlation increase from 0.954 to 0.970

Combining MS-STFT with MS-SB-CQTD leads to further improvements (up to 3.87 MOS). PESQ increases by 0.05–0.10, and F0-RMSE decreases by 10–20 cents when swapping the STFT-based discriminator for MS-SB-CQTD. An ablation removing SBP causes increased RMSE and reduced perceptual metrics, confirming the necessity of octave-wise processing (Gu et al., 2024).

Key hyper-parameters and practices:

Audio sampled at 24 kHz, upsampled to 48 kHz before CQT (to avoid aliasing at highest octaves)
CQT global hop length: 256 samples
Sub-discriminators: $k$ 4
Octaves: $k$ 5
Sub-band filters: 32 channels; WeightNorm on all convolutions
Training: AdamW optimizer ( $k$ 6, $k$ 7, lr $k$ 8), batch size 16 per GPU, $k$ 91.5M steps

MS-SB-CQTD advances the paradigm of time–frequency representation discriminators in neural vocoding. Unlike classical MS-STFT discriminators, which impose rigid time–frequency resolution, the CQT and SBP architecture enable scale-adaptive, high-fidelity scrutiny of harmonic structures—vital for expressive signals such as singing voice. These design strengths are empirically validated across multiple corpora and vocoder architectures, and the module integrates seamlessly with existing GAN pipelines (Gu et al., 2024, Gu et al., 2023).

Earlier work by Sprechmann, Bruna, and LeCun explored the application of multi-scale CQT-like (scattering) transforms for audio feature extraction, but these were employed as input representations for supervised separation networks rather than adversarial discriminators (Sprechmann et al., 2014). The discriminative capacity of the CQT is distinct from, and complementary to, both STFT and continuous wavelet-based approaches. Empirical results indicate that joint training with multiple such discriminators provides optimal coverage of both pitch and temporal detail.

8. Limitations and Future Directions

Use of MS-SB-CQTD introduces negligible additional inference cost, as discriminators are not used at test time. The main limitation arises from the increased complexity during training. The architecture is fully modular, allowing for straightforward integration into any GAN-based vocoder. Prospective research directions include exploring more adaptive time–frequency transforms, optimizing SBP strategies, or designing hybrid frequency representation frameworks that further enhance harmonic discrimination (Gu et al., 2024, Gu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (3)

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder (2024)

Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder (2023)

Audio Source Separation with Discriminative Scattering Networks (2014)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Sub-Band Constant-Q Transform Discriminator (MS-SB-CQTD).