Learnable Harmonic Triangular Filter Banks
- The paper presents a novel differentiable harmonic filterbank with triangular profiles that learns center frequencies, bandwidths, and a sharpness factor for precise harmonic tracking.
- It employs exponential center frequency spacing and ERB-inspired bandwidths, achieving improved perceptual and objective metrics in speech and singing synthesis and recognition.
- The method integrates seamlessly with both STFT and raw waveform pipelines, enabling joint optimization with GAN-based vocoders and end-to-end phone recognition systems.
A harmonic filter with learnable triangular band-pass filter banks defines a differentiable front-end signal representation architecture in which each filter targets a specific harmonic or subharmonic band, employs triangular frequency-domain profiles with flexible bandwidths, and allows data-driven adaptation of both the center frequencies and “sharpness” of the pass-bands. This structure explicitly aligns its bands to multiples of base frequencies, is suitable for both STFT-domain and raw waveform front-ends, and is optimized jointly with downstream neural architectures, including GAN-based vocoder discriminators and end-to-end phone recognition systems. The learnable parameters—center frequencies, bandwidths, and overall scaling—permit dynamic adjustment to the signal’s spectral structure and enable fine-grained harmonic resolution, notably improving perceptual and objective metrics for speech and singing synthesis and recognition (Xu et al., 3 Dec 2025, Zeghidour et al., 2017).
1. Mathematical Construction of Triangular Band-Pass Harmonic Filters
The core of the design is a family of triangular band-pass filters parameterized by harmonic index and base-band index . For a continuous frequency variable and base center frequencies , the triangular filter applied to the -th harmonic is
where is the bandwidth for the -th harmonic and denotes rectification (i.e., ). Each filter is peaked at and linearly decays to zero at .
In time-domain filterbank learning, as established in Zeghidour et al. (Zeghidour et al., 2017), each filter’s idealized frequency response is triangular,
with the impulse response (via inverse Fourier transform)
For differentiable implementations, complex Gabor wavelets are employed, with the Gaussian width tuned to match the triangular pass-band.
2. Parameterization and Learning Rules for Harmonic Filter Banks
Base-band center frequencies are distributed exponentially: where Hz (e.g., musical low C) and is bins-per-octave (e.g., yielding ). The harmonic filter’s center frequency is . Bandwidths follow an ERB-inspired scale,
and incorporate a learnable sharpness factor : is a global scalar, initialized to 1 and learned jointly with network parameters during training.
In time-domain architectures (Zeghidour et al., 2017), each complex filter is parameterized by convolutional kernel weights , with updates performed via automatic differentiation. Optionally, filter center frequencies can be tied to harmonic multiples (with learnable or predicted), or regularized softly via a penalty enforcing a harmonic comb structure.
3. Mechanisms for Dynamic Frequency-Resolution and Harmonic Tracking
By scaling with frequency, low-frequency harmonics are represented using narrow bands permitting high spectral resolution, while higher harmonics receive broader bands favoring temporal over spectral detail. The learnable global sharpness ensures that the overall resolution dynamically adapts to the characteristics of the input signal. These mechanisms enable individual bands to track harmonics precisely, enhancing the representation of voiced content, vibrato, and singing-specific nuances.
The addition of a half-harmonic filter, , further increases sensitivity to low-pitch energy. Its parameters are set analogously: where follows the ERB+ curve with .
4. Implementation in Signal Processing Pipelines
For STFT-based front-ends (Xu et al., 3 Dec 2025), the filter bank is applied to magnitude spectrograms by forming a 3-D tensor : This yields transformed input to the discriminator with shape .
In time-domain architectures (Zeghidour et al., 2017), the workflow includes:
- Complex convolution with filterbank kernels,
- L2 modulus pooling across real/imaginary parts,
- Optional squaring and low-pass grouped convolution for smoothing/decimation,
- Logarithmic compression and per-utterance normalization,
- Feature extraction for downstream neuralnets.
A sample PyTorch workflow is:
1 2 3 4 5 6 |
c = conv1d(x, W, bias=None, padding=W//2) re, im = c[:, :K, :], c[:, K:, :] y = sqrt(re**2 + im**2 + eps) z = grouped_conv1d(y, V, stride=stride_lp) out = log10(max(z,1e-5) + 1.0) features = out.unsqueeze(1) |
5. Integration with GAN Discriminators and Training Objectives
The processed tensor serves as input to a time-frequency harmonic discriminator in a GAN-based vocoder framework (Xu et al., 3 Dec 2025). The loss function combines:
- Discriminator adversarial loss (hinge or least-squares form),
- Generator adversarial loss,
- Feature-matching loss over multi-scale discriminator features.
All filter bank parameters—including the global —receive gradients in end-to-end training, learning directly from the task objective.
6. Empirical Evaluation and Observed Effects
Objective and subjective metrics on both speech and singing demonstrate clear gains from harmonic filter banks with learnable triangular profiles:
- PESQ improved by up to 0.07–0.13,
- MCD reduced by 0.04–0.18 dB,
- F0RMSE reduced by 1–7 points,
- MOS (Mean Opinion Score) gains of 0.1–0.3 on five-point scales. Spectrograms of generated signals display sharper and more accurate harmonic lines, and aliasing artifacts are substantially reduced at high frequencies.
Ablation studies confirm:
- Removing the half-harmonic channel increases F0RMSE by ≈2 Hz.
- Disabling or the triangular band shape reliably degrades both spectral and pitch performance.
In time-domain phone recognition (Zeghidour et al., 2017), learnable band-pass filterbanks outperform hand-crafted MFSC by 0.3–0.6 PER absolute, and learned filters display asymmetric impulse responses and spread bandwidths reminiscent of cochlear filters.
7. Architectural and Practical Considerations
The learnable harmonic triangular filterbank design is highly flexible:
- It admits differentiable, data-driven optimization in both STFT and raw waveform settings.
- Initialization can follow established mel-scale protocols, with subsequent learning.
- Harmonic tracking is achieved via explicit parameter tying or soft regularization.
- Implementation is straightforward in modern deep learning frameworks (PyTorch, TensorFlow).
- Preprocessing choices (window lengths, decimation rates, normalization strategies) can be tuned for specific tasks.
A plausible implication is that such filterbanks may generalize well to languages, voice types, and musical instruments with rich harmonic structure, providing a universal interface for time-frequency discriminators, recognition models, and vocoders.
References:
- "A Universal Harmonic Discriminator for High-quality GAN-based Vocoder" (Xu et al., 3 Dec 2025)
- "Learning Filterbanks from Raw Speech for Phone Recognition" (Zeghidour et al., 2017)