Multi-Scale SB-CQT Discriminator for Vocoders
- MS-SB-CQTD is a neural module that applies multi-scale Constant-Q Transform spectrograms to enhance pitch accuracy and harmonic detail in vocoder systems.
- It employs octave-wise sub-band processing with dedicated 2D CNNs to align temporal features across variable window lengths inherent in CQT representations.
- Integration with GAN vocoders such as HiFi-GAN demonstrates measurable improvements in F0 accuracy and perceptual quality, enabling high-fidelity speech and singing synthesis.
The Multi-Scale Sub-Band Constant-Q Transform Discriminator (MS-SB-CQTD) is a neural network module designed for use as the principal adversarial discriminator in GAN-based high-fidelity vocoders. It operates on multiple Constant-Q Transform (CQT) spectrograms at different frequency resolutions, leverages a sub-band (octave-wise) grouping architecture to align temporal features, and provides complementary discrimination to Short-Time Fourier Transform (STFT)-based models, specifically enhancing pitch accuracy and harmonic detail in both speech and singing voice synthesis (Gu et al., 2024, Gu et al., 2023).
1. Mathematical Foundations: Constant-Q Transform
The MS-SB-CQTD operates on the CQT, a time–frequency representation characterized by geometrically spaced center frequencies and variable window lengths, with a constant quality factor . Given a real input signal , the CQT is defined by: where indexes frequency bins, each associated with filter
with a window function (typically Hann), and the window length for bin . The filterbank is constructed with a constant Q-factor: where is the number of bins per octave, 0 for a base frequency 1 (commonly 32.7 Hz, C1), and 2. Thus, lower frequencies yield narrow-band, long-duration filters and higher frequencies give wide-band, short-duration filters.
2. Sub-Band Processor (SBP): Octave-wise Grouping and Temporal Alignment
Due to the octave-varying filter lengths, CQT spectrograms exhibit temporal misalignment across frequency bins. The SBP module mitigates this by dividing the CQT (real and imaginary channels) into 3 octave sub-bands, each corresponding to a set of frequency bins localized to one octave (for typical configurations, 4 for 32.7 Hz to ~16 kHz). Each sub-band is processed by a dedicated 2D CNN (kernel size 3×9, two input channels), producing learned latents of size 5 per band, with 6. These are concatenated along the frequency axis, restoring temporal alignment across octaves and yielding a feature map of size 7. This octave-wise grouping is essential for mitigating the desynchronization effects that arise in naive 2D convolutions over CQTs with frequency-dependent windows (Gu et al., 2024, Gu et al., 2023).
3. Multi-Scale CQT Architecture
MS-SB-CQTD employs 8 sub-discriminators, each operating on a different CQT configuration to capture a spectrum of time–frequency resolutions:
- Each scale 9 uses 0 bins per octave
- Small 1 (e.g., 24): long filters at low frequencies, high frequency resolution
- Large 2 (e.g., 48): short filters, high time resolution at high frequencies
At each scale, the CQT is computed with global hop length 256 samples, yielding total bins 3.
4. Sub-Discriminator Neural Network Structures
Each of the 4 sub-discriminators has the following architecture:
- SBP front-end: 2D CNN on each octave sub-band, as described above
- Main Stack:
- Conv2D (kernel=3×8, stride=1×2, channels=32, LeakyReLU, WeightNorm)
- Three dilated Conv2D blocks (kernel=3×8, stride=1×2, dilations 1/2/4, channels=32, LeakyReLU, WeightNorm)
- Final Conv2D (kernel=3×3, stride=1×1, channels=1)
- The output is a scalar, 5, obtained by averaging the final feature map over time and frequency.
A representative layout is shown below:
| Component | Kernel | Stride | Dilations | Channels | Activation |
|---|---|---|---|---|---|
| SBP Conv | 3×9 | 1×1 | — | 2→32 | LeakyReLU(0.2) |
| Main Conv #1 | 3×8 | 1×2 | 1 | 32 | LeakyReLU |
| Main Conv #2 | 3×8 | 1×2 | 2 | 32 | LeakyReLU |
| Main Conv #3 | 3×8 | 1×2 | 4 | 32 | LeakyReLU |
| Final Conv | 3×3 | 1×1 | — | 1 | — |
5. Integration into GAN-Vocoder Training
MS-SB-CQTD is designed for drop-in integration with common GAN vocoders (e.g., HiFi-GAN, BigVGAN, APNet). The full adversarial training objective consists of:
- Adversarial loss for discriminator 6 (for real 7 and generated 8 inputs):
9
- Generator adversarial and feature-matching losses:
0
1
- Total generator loss:
2
- Total discriminator loss:
3
This setup can be freely extended to include multi-scale discriminators based on both STFT and CQT, with empirical results supporting joint usage for maximal synthesis quality (Gu et al., 2024, Gu et al., 2023).
6. Empirical Gains, Ablations, and Implementation Details
Extensive quantitative evaluation demonstrates that MS-SB-CQTD provides substantial improvements across objective and subjective metrics. In singing voice vocoding (HiFi-GAN baseline), integrating MS-SB-CQTD yields:
- F0RMSE reduction from 56.96 Hz to 35.57 Hz
- MOS improvement from 3.27 to 3.66 (seen singers)
- F0 Pearson correlation increase from 0.954 to 0.970
Combining MS-STFT with MS-SB-CQTD leads to further improvements (up to 3.87 MOS). PESQ increases by 0.05–0.10, and F0-RMSE decreases by 10–20 cents when swapping the STFT-based discriminator for MS-SB-CQTD. An ablation removing SBP causes increased RMSE and reduced perceptual metrics, confirming the necessity of octave-wise processing (Gu et al., 2024).
Key hyper-parameters and practices:
- Audio sampled at 24 kHz, upsampled to 48 kHz before CQT (to avoid aliasing at highest octaves)
- CQT global hop length: 256 samples
- Sub-discriminators: 4
- Octaves: 5
- Sub-band filters: 32 channels; WeightNorm on all convolutions
- Training: AdamW optimizer (6, 7, lr 8), batch size 16 per GPU, 91.5M steps
7. Context, Significance, and Related Approaches
MS-SB-CQTD advances the paradigm of time–frequency representation discriminators in neural vocoding. Unlike classical MS-STFT discriminators, which impose rigid time–frequency resolution, the CQT and SBP architecture enable scale-adaptive, high-fidelity scrutiny of harmonic structures—vital for expressive signals such as singing voice. These design strengths are empirically validated across multiple corpora and vocoder architectures, and the module integrates seamlessly with existing GAN pipelines (Gu et al., 2024, Gu et al., 2023).
Earlier work by Sprechmann, Bruna, and LeCun explored the application of multi-scale CQT-like (scattering) transforms for audio feature extraction, but these were employed as input representations for supervised separation networks rather than adversarial discriminators (Sprechmann et al., 2014). The discriminative capacity of the CQT is distinct from, and complementary to, both STFT and continuous wavelet-based approaches. Empirical results indicate that joint training with multiple such discriminators provides optimal coverage of both pitch and temporal detail.
8. Limitations and Future Directions
Use of MS-SB-CQTD introduces negligible additional inference cost, as discriminators are not used at test time. The main limitation arises from the increased complexity during training. The architecture is fully modular, allowing for straightforward integration into any GAN-based vocoder. Prospective research directions include exploring more adaptive time–frequency transforms, optimizing SBP strategies, or designing hybrid frequency representation frameworks that further enhance harmonic discrimination (Gu et al., 2024, Gu et al., 2023).