Learnable Harmonic Triangular Filter Banks

Updated 4 December 2025

The paper presents a novel differentiable harmonic filterbank with triangular profiles that learns center frequencies, bandwidths, and a sharpness factor for precise harmonic tracking.
It employs exponential center frequency spacing and ERB-inspired bandwidths, achieving improved perceptual and objective metrics in speech and singing synthesis and recognition.
The method integrates seamlessly with both STFT and raw waveform pipelines, enabling joint optimization with GAN-based vocoders and end-to-end phone recognition systems.

A harmonic filter with learnable triangular band-pass filter banks defines a differentiable front-end signal representation architecture in which each filter targets a specific harmonic or subharmonic band, employs triangular frequency-domain profiles with flexible bandwidths, and allows data-driven adaptation of both the center frequencies and “sharpness” of the pass-bands. This structure explicitly aligns its bands to multiples of base frequencies, is suitable for both STFT-domain and raw waveform front-ends, and is optimized jointly with downstream neural architectures, including GAN-based vocoder discriminators and end-to-end phone recognition systems. The learnable parameters—center frequencies, bandwidths, and overall scaling—permit dynamic adjustment to the signal’s spectral structure and enable fine-grained harmonic resolution, notably improving perceptual and objective metrics for speech and singing synthesis and recognition (Xu et al., 3 Dec 2025, Zeghidour et al., 2017).

1. Mathematical Construction of Triangular Band-Pass Harmonic Filters

The core of the design is a family of triangular band-pass filters parameterized by harmonic index $h$ and base-band index $n$ . For a continuous frequency variable $f$ and base center frequencies $f_n$ , the triangular filter applied to the $h$ -th harmonic is

$H_{h,n}(f) = \Big[1 - \frac{2 |f - h f_n|}{f_{bw}^h}\Big]_+,$

where $f_{bw}^h$ is the bandwidth for the $h$ -th harmonic and $[\,\cdot\,]_+$ denotes rectification (i.e., $\max(\cdot, 0)$ ). Each filter is peaked at $f = h f_n$ and linearly decays to zero at $f = h f_n \pm f_{bw}^h/2$ .

In time-domain filterbank learning, as established in Zeghidour et al. (Zeghidour et al., 2017), each filter’s idealized frequency response is triangular,

$H_k(\omega) = \begin{cases} 1 - \frac{|\omega - 2\pi f_k|}{2\pi \Delta f_k}, & |\omega - 2\pi f_k| \le 2\pi \Delta f_k \ 0, & \text{otherwise} \end{cases}$

with the impulse response (via inverse Fourier transform)

$h_k(t) = \Delta f_k \left[\mathrm{sinc}\left(\Delta f_k t\right)\right]^2 \cos(2\pi f_k t).$

For differentiable implementations, complex Gabor wavelets are employed, with the Gaussian width $\sigma_k$ tuned to match the triangular pass-band.

2. Parameterization and Learning Rules for Harmonic Filter Banks

Base-band center frequencies are distributed exponentially: $f_n = f_{\text{min}} \cdot 2^{(n-1)/B}, \quad n=1 \ldots F,$ where $f_{\text{min}} = 32.7$ Hz (e.g., musical low C) and $B$ is bins-per-octave (e.g., $B = 24$ yielding $F \approx 124$ ). The harmonic $h$ filter’s center frequency is $h f_n$ . Bandwidths follow an ERB-inspired scale,

$f_{bw}(f_c) \simeq 0.1079 f_c + 24.7,$

and incorporate a learnable sharpness factor $\gamma \geq 1$ : $f_{bw}^h = \frac{0.1079 \cdot h f_n + 24.7}{\gamma}.$ $\gamma$ is a global scalar, initialized to 1 and learned jointly with network parameters during training.

In time-domain architectures (Zeghidour et al., 2017), each complex filter is parameterized by convolutional kernel weights $\mathbf{W} \in \mathbb{R}^{2K \times W}$ , with updates performed via automatic differentiation. Optionally, filter center frequencies can be tied to harmonic multiples $f_k = m_k F_0$ (with $F_0$ learnable or predicted), or regularized softly via a penalty enforcing a harmonic comb structure.

3. Mechanisms for Dynamic Frequency-Resolution and Harmonic Tracking

By scaling $f_{bw}^h$ with frequency, low-frequency harmonics are represented using narrow bands permitting high spectral resolution, while higher harmonics receive broader bands favoring temporal over spectral detail. The learnable global sharpness $\gamma$ ensures that the overall resolution dynamically adapts to the characteristics of the input signal. These mechanisms enable individual bands to track harmonics precisely, enhancing the representation of voiced content, vibrato, and singing-specific nuances.

The addition of a half-harmonic filter, $h=0.5$ , further increases sensitivity to low-pitch energy. Its parameters are set analogously: $H_{0.5, n}(f) = \Big[1 - \frac{2 |f - 0.5 f_n| }{f_{bw}^{0.5}}\Big]_+,$ where $f_{bw}^{0.5}$ follows the ERB+ $\gamma$ curve with $f_c=0.5 f_n$ .

4. Implementation in Signal Processing Pipelines

For STFT-based front-ends (Xu et al., 3 Dec 2025), the filter bank is applied to magnitude spectrograms $X(f_k, t)$ by forming a 3-D tensor $Y(h, n, t)$ : $Y(h, n, t) = \sum_{k=1}^F X(f_k, t) \cdot H_{h, n}(f_k)$ This yields transformed input to the discriminator with shape $(H + 0.5) \times F \times T$ .

In time-domain architectures (Zeghidour et al., 2017), the workflow includes:

Complex convolution with filterbank kernels,
L2 modulus pooling across real/imaginary parts,
Optional squaring and low-pass grouped convolution for smoothing/decimation,
Logarithmic compression and per-utterance normalization,
Feature extraction for downstream neuralnets.

A sample PyTorch workflow is:

c = conv1d(x, W, bias=None, padding=W//2)
re, im = c[:, :K, :], c[:, K:, :]
y = sqrt(re**2 + im**2 + eps)
z = grouped_conv1d(y, V, stride=stride_lp)
out = log10(max(z,1e-5) + 1.0)
features = out.unsqueeze(1)

All relevant weights are updated via back-propagation of the task loss.

5. Integration with GAN Discriminators and Training Objectives

The processed tensor $Y$ serves as input to a time-frequency harmonic discriminator $D_\theta$ in a GAN-based vocoder framework (Xu et al., 3 Dec 2025). The loss function combines:

Discriminator adversarial loss (hinge or least-squares form),
Generator adversarial loss,
Feature-matching loss over multi-scale discriminator features.

All filter bank parameters—including the global $\gamma$ —receive gradients in end-to-end training, learning directly from the task objective.

6. Empirical Evaluation and Observed Effects

Objective and subjective metrics on both speech and singing demonstrate clear gains from harmonic filter banks with learnable triangular profiles:

PESQ improved by up to 0.07–0.13,
MCD reduced by 0.04–0.18 dB,
F0RMSE reduced by 1–7 points,
MOS (Mean Opinion Score) gains of 0.1–0.3 on five-point scales. Spectrograms of generated signals display sharper and more accurate harmonic lines, and aliasing artifacts are substantially reduced at high frequencies.

Ablation studies confirm:

Removing the half-harmonic channel increases F0RMSE by ≈2 Hz.
Disabling $\gamma$ or the triangular band shape reliably degrades both spectral and pitch performance.

In time-domain phone recognition (Zeghidour et al., 2017), learnable band-pass filterbanks outperform hand-crafted MFSC by 0.3–0.6 PER absolute, and learned filters display asymmetric impulse responses and spread bandwidths reminiscent of cochlear filters.

7. Architectural and Practical Considerations

The learnable harmonic triangular filterbank design is highly flexible:

It admits differentiable, data-driven optimization in both STFT and raw waveform settings.
Initialization can follow established mel-scale protocols, with subsequent learning.
Harmonic tracking is achieved via explicit parameter tying or soft regularization.
Implementation is straightforward in modern deep learning frameworks (PyTorch, TensorFlow).
Preprocessing choices (window lengths, decimation rates, normalization strategies) can be tuned for specific tasks.

A plausible implication is that such filterbanks may generalize well to languages, voice types, and musical instruments with rich harmonic structure, providing a universal interface for time-frequency discriminators, recognition models, and vocoders.

References:

"A Universal Harmonic Discriminator for High-quality GAN-based Vocoder" (Xu et al., 3 Dec 2025)
"Learning Filterbanks from Raw Speech for Phone Recognition" (Zeghidour et al., 2017)