Papers
Topics
Authors
Recent
2000 character limit reached

Learnable Harmonic Triangular Filter Banks

Updated 4 December 2025
  • The paper presents a novel differentiable harmonic filterbank with triangular profiles that learns center frequencies, bandwidths, and a sharpness factor for precise harmonic tracking.
  • It employs exponential center frequency spacing and ERB-inspired bandwidths, achieving improved perceptual and objective metrics in speech and singing synthesis and recognition.
  • The method integrates seamlessly with both STFT and raw waveform pipelines, enabling joint optimization with GAN-based vocoders and end-to-end phone recognition systems.

A harmonic filter with learnable triangular band-pass filter banks defines a differentiable front-end signal representation architecture in which each filter targets a specific harmonic or subharmonic band, employs triangular frequency-domain profiles with flexible bandwidths, and allows data-driven adaptation of both the center frequencies and “sharpness” of the pass-bands. This structure explicitly aligns its bands to multiples of base frequencies, is suitable for both STFT-domain and raw waveform front-ends, and is optimized jointly with downstream neural architectures, including GAN-based vocoder discriminators and end-to-end phone recognition systems. The learnable parameters—center frequencies, bandwidths, and overall scaling—permit dynamic adjustment to the signal’s spectral structure and enable fine-grained harmonic resolution, notably improving perceptual and objective metrics for speech and singing synthesis and recognition (Xu et al., 3 Dec 2025, Zeghidour et al., 2017).

1. Mathematical Construction of Triangular Band-Pass Harmonic Filters

The core of the design is a family of triangular band-pass filters parameterized by harmonic index hh and base-band index nn. For a continuous frequency variable ff and base center frequencies fnf_n, the triangular filter applied to the hh-th harmonic is

Hh,n(f)=[12fhfnfbwh]+,H_{h,n}(f) = \Big[1 - \frac{2 |f - h f_n|}{f_{bw}^h}\Big]_+,

where fbwhf_{bw}^h is the bandwidth for the hh-th harmonic and []+[\,\cdot\,]_+ denotes rectification (i.e., max(,0)\max(\cdot, 0)). Each filter is peaked at f=hfnf = h f_n and linearly decays to zero at f=hfn±fbwh/2f = h f_n \pm f_{bw}^h/2.

In time-domain filterbank learning, as established in Zeghidour et al. (Zeghidour et al., 2017), each filter’s idealized frequency response is triangular,

Hk(ω)={1ω2πfk2πΔfk,ω2πfk2πΔfk 0,otherwiseH_k(\omega) = \begin{cases} 1 - \frac{|\omega - 2\pi f_k|}{2\pi \Delta f_k}, & |\omega - 2\pi f_k| \le 2\pi \Delta f_k \ 0, & \text{otherwise} \end{cases}

with the impulse response (via inverse Fourier transform)

hk(t)=Δfk[sinc(Δfkt)]2cos(2πfkt).h_k(t) = \Delta f_k \left[\mathrm{sinc}\left(\Delta f_k t\right)\right]^2 \cos(2\pi f_k t).

For differentiable implementations, complex Gabor wavelets are employed, with the Gaussian width σk\sigma_k tuned to match the triangular pass-band.

2. Parameterization and Learning Rules for Harmonic Filter Banks

Base-band center frequencies are distributed exponentially: fn=fmin2(n1)/B,n=1F,f_n = f_{\text{min}} \cdot 2^{(n-1)/B}, \quad n=1 \ldots F, where fmin=32.7f_{\text{min}} = 32.7 Hz (e.g., musical low C) and BB is bins-per-octave (e.g., B=24B = 24 yielding F124F \approx 124). The harmonic hh filter’s center frequency is hfnh f_n. Bandwidths follow an ERB-inspired scale,

fbw(fc)0.1079fc+24.7,f_{bw}(f_c) \simeq 0.1079 f_c + 24.7,

and incorporate a learnable sharpness factor γ1\gamma \geq 1: fbwh=0.1079hfn+24.7γ.f_{bw}^h = \frac{0.1079 \cdot h f_n + 24.7}{\gamma}. γ\gamma is a global scalar, initialized to 1 and learned jointly with network parameters during training.

In time-domain architectures (Zeghidour et al., 2017), each complex filter is parameterized by convolutional kernel weights WR2K×W\mathbf{W} \in \mathbb{R}^{2K \times W}, with updates performed via automatic differentiation. Optionally, filter center frequencies can be tied to harmonic multiples fk=mkF0f_k = m_k F_0 (with F0F_0 learnable or predicted), or regularized softly via a penalty enforcing a harmonic comb structure.

3. Mechanisms for Dynamic Frequency-Resolution and Harmonic Tracking

By scaling fbwhf_{bw}^h with frequency, low-frequency harmonics are represented using narrow bands permitting high spectral resolution, while higher harmonics receive broader bands favoring temporal over spectral detail. The learnable global sharpness γ\gamma ensures that the overall resolution dynamically adapts to the characteristics of the input signal. These mechanisms enable individual bands to track harmonics precisely, enhancing the representation of voiced content, vibrato, and singing-specific nuances.

The addition of a half-harmonic filter, h=0.5h=0.5, further increases sensitivity to low-pitch energy. Its parameters are set analogously: H0.5,n(f)=[12f0.5fnfbw0.5]+,H_{0.5, n}(f) = \Big[1 - \frac{2 |f - 0.5 f_n| }{f_{bw}^{0.5}}\Big]_+, where fbw0.5f_{bw}^{0.5} follows the ERB+γ\gamma curve with fc=0.5fnf_c=0.5 f_n.

4. Implementation in Signal Processing Pipelines

For STFT-based front-ends (Xu et al., 3 Dec 2025), the filter bank is applied to magnitude spectrograms X(fk,t)X(f_k, t) by forming a 3-D tensor Y(h,n,t)Y(h, n, t): Y(h,n,t)=k=1FX(fk,t)Hh,n(fk)Y(h, n, t) = \sum_{k=1}^F X(f_k, t) \cdot H_{h, n}(f_k) This yields transformed input to the discriminator with shape (H+0.5)×F×T(H + 0.5) \times F \times T.

In time-domain architectures (Zeghidour et al., 2017), the workflow includes:

  • Complex convolution with filterbank kernels,
  • L2 modulus pooling across real/imaginary parts,
  • Optional squaring and low-pass grouped convolution for smoothing/decimation,
  • Logarithmic compression and per-utterance normalization,
  • Feature extraction for downstream neuralnets.

A sample PyTorch workflow is:

1
2
3
4
5
6
c = conv1d(x, W, bias=None, padding=W//2)
re, im = c[:, :K, :], c[:, K:, :]
y = sqrt(re**2 + im**2 + eps)
z = grouped_conv1d(y, V, stride=stride_lp)
out = log10(max(z,1e-5) + 1.0)
features = out.unsqueeze(1)
All relevant weights are updated via back-propagation of the task loss.

5. Integration with GAN Discriminators and Training Objectives

The processed tensor YY serves as input to a time-frequency harmonic discriminator DθD_\theta in a GAN-based vocoder framework (Xu et al., 3 Dec 2025). The loss function combines:

  • Discriminator adversarial loss (hinge or least-squares form),
  • Generator adversarial loss,
  • Feature-matching loss over multi-scale discriminator features.

All filter bank parameters—including the global γ\gamma—receive gradients in end-to-end training, learning directly from the task objective.

6. Empirical Evaluation and Observed Effects

Objective and subjective metrics on both speech and singing demonstrate clear gains from harmonic filter banks with learnable triangular profiles:

  • PESQ improved by up to 0.07–0.13,
  • MCD reduced by 0.04–0.18 dB,
  • F0RMSE reduced by 1–7 points,
  • MOS (Mean Opinion Score) gains of 0.1–0.3 on five-point scales. Spectrograms of generated signals display sharper and more accurate harmonic lines, and aliasing artifacts are substantially reduced at high frequencies.

Ablation studies confirm:

  • Removing the half-harmonic channel increases F0RMSE by ≈2 Hz.
  • Disabling γ\gamma or the triangular band shape reliably degrades both spectral and pitch performance.

In time-domain phone recognition (Zeghidour et al., 2017), learnable band-pass filterbanks outperform hand-crafted MFSC by 0.3–0.6 PER absolute, and learned filters display asymmetric impulse responses and spread bandwidths reminiscent of cochlear filters.

7. Architectural and Practical Considerations

The learnable harmonic triangular filterbank design is highly flexible:

  • It admits differentiable, data-driven optimization in both STFT and raw waveform settings.
  • Initialization can follow established mel-scale protocols, with subsequent learning.
  • Harmonic tracking is achieved via explicit parameter tying or soft regularization.
  • Implementation is straightforward in modern deep learning frameworks (PyTorch, TensorFlow).
  • Preprocessing choices (window lengths, decimation rates, normalization strategies) can be tuned for specific tasks.

A plausible implication is that such filterbanks may generalize well to languages, voice types, and musical instruments with rich harmonic structure, providing a universal interface for time-frequency discriminators, recognition models, and vocoders.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Harmonic Filter with Learnable Triangular Band-Pass Filter Banks.