Learnable Filterbanks for Adaptive Signal Processing
- Learnable filterbanks are differentiable, parameterized filters that adapt spectral selectivity and temporal properties in ML pipelines.
- They encompass diverse architectures including time-domain convolutional, parametric analytic, spectral-domain matrix, and hierarchical designs.
- These adaptive systems enhance applications such as speech recognition, source separation, music transcription, and biomedical event detection.
Learnable filterbanks are differentiable, parameterized filterbanks whose structure, frequency responses, or other operational parameters are exposed to optimization within a larger machine learning pipeline. Unlike conventional hand-engineered filterbanks (e.g., STFT, Mel, Gammatone), learnable filterbanks are trained end-to-end with the rest of the network and can adapt to the underlying data, task, or objective by reshaping both their spectral selectivity and their temporal or analytic properties. A wide spectrum of learnable filterbank architectures has been developed, encompassing fixed-shape parameterizations (e.g., Sinc, Gabor, analytic), unconstrained depthwise convolutional layers, matrix-based spectral domain transformations, and hybrid domain-specific architectures for tasks as varied as speech recognition, source separation, environmental sound classification, music information retrieval, and biomedical event detection.
1. Mathematical Parameterizations and Model Classes
Learnable filterbanks can be grouped based on parameterization and implementation.
a) Time-Domain Convolutional Filterbanks:
TD-filterbanks operate directly on raw waveforms using banks of real or complex convolutional kernels, where each filter is parameterized as a vector (or for analytic cases) and denotes the number of channels. These filters can be initialized to approximate a mel-scale, Gabor wavelets, or Fourier bases, but are generally free to evolve under joint optimization with downstream supervised or unsupervised losses (Zeghidour et al., 2017, Cwitkowitz et al., 2021, Dai et al., 2023, Pariente et al., 2019).
b) Parametric Analytic Filterbanks:
Here, each filter is defined by a small set of continuous parameters: center frequencies , bandwidths (for Gabor-shaped), or lower/upper cutoffs (SincNet). Analyticity can be enforced by constructing the imaginary part via a Hilbert transform, yielding inherently one-sided frequency responses (Cwitkowitz et al., 2021, Pariente et al., 2019, Haider et al., 12 May 2025).
c) Spectral-Domain Matrix Filterbanks:
This approach learns a nonnegative transformation matrix operating on STFT or power spectra : . can be initialized from conventional Mel or triangular bands and is typically regularized using ReLU (for non-negativity) and batch normalization (López-Espejo et al., 2020, López-Espejo et al., 2022, Qu et al., 2016).
d) Physiologically-Motivated Parametric Banks:
Filterbanks based on auditory models such as gammatone or gammachirp allow learning over psychoacoustic parameters (center, bandwidth, asymmetry/chirp, envelope order) to better mimic or diverge from known perceptual filters (López-Espejo et al., 2020, Humayun et al., 2019).
e) Hierarchical or Multiscale Filterbanks:
Architectures such as the convolutional dictionary model employ multiple scales, allocating separate low-pass and high-pass filterbanks at each layer to capture both dense and sparse structures, and train these using alternating minimization over signal hierarchies (Zazo et al., 2019, Pfister et al., 2018).
2. Learning, Initialization, and Optimization Strategies
Learnable filterbanks are usually initialized to conventional filter shapes (mel, bark, linear, CQT) for stable convergence and perceptual plausibility. However, studies have established that the movement of learned parameters away from these initializations is often small, with the network frequently retaining mel-like spacing and bandwidths, especially when configuration and learning rates are not specifically tuned to encourage divergence (Anderson et al., 2023, Schlüter et al., 2022). For analytic filterbanks, the initialization is often as Gabor or windowed-sine, and analytic extensions are obtained via Hilbert transforms to ensure robust envelope extraction (Pariente et al., 2019, Cwitkowitz et al., 2021).
Alternate optimization strategies—including staged training (filterbank-only pretraining), stronger frontend learning rates, explicit regularization terms favoring filter movement, or multiple initialization trials—are proposed to overcome the high dependence of learned filterbanks on their starting configuration (Anderson et al., 2023).
3. Integration Architectures and Signal Flows
Learnable filterbanks are integrated as differentiable neural modules at the signal front-end. Two major embedding paradigms dominate:
- Direct Convolutional Frontends: Filterbank outputs are passed through standard nonlinearities (e.g., modulus, square, log, PCEN, batchnorm), optionally followed by temporal pooling or further convolutions, and consumed by DNN backbones such as CNNs, EfficientNet, or MobileNet variants (Schlüter et al., 2022, Elsborg et al., 29 May 2025, Li et al., 2022).
- Matrix-Based Spectral Transformations: These operate on intermediate time-frequency representations, mapping STFT or power spectrograms to lower-dimensional features using a learned matrix or parametric transformation, after which standard classification architectures operate on the resulting feature sequences (Qu et al., 2016, López-Espejo et al., 2020).
- Perfect-Reconstruction Architectures (e.g., ISAC): Analysis-synthesis pairs of learnable filterbanks with explicit PR conditions are implemented as Conv1D/ConvTranspose1D layers, supporting both feature extraction and signal resynthesis, with optional invertibility regularization via condition number penalties (Haider et al., 12 May 2025).
Pooling, normalization, and compression can be implemented as learnable modules (PCEN, log-median, Gaussian pooling, attention/instance normalization) for further adaptation to dataset statistics or invariances (Schlüter et al., 2022, Elsborg et al., 29 May 2025).
4. Task-Specific Applications and Empirical Findings
The empirical efficacy of learnable filterbanks spans various audio and signal domains:
- Speech and Keyword Spotting: End-to-end learned filterbanks yield comparable or modest gains over traditional mel features, with pronounced robustness to noise and substantial reductions in computational cost via aggressive channel reduction. For instance, keyword spotting systems achieve nearly identical accuracy using 8-channel learned banks as with 40-channel mel, but at 6.3× lower energy expenditure; learned filters adapt to suppress noise-dominated bands (López-Espejo et al., 2022, López-Espejo et al., 2020).
- Source Separation and Beamforming: Analytic, complex-valued learned filterbanks outperform real-valued learned banks and STFT in mask-based separation tasks at all but ultra-short window lengths (2 ms), providing shift-invariant representations and improved phase handling (Pariente et al., 2019, Cornell et al., 2021, Dai et al., 2023).
- Music Transcription: Analytic filterbanks with variational dropout exhibit interpretable harmonic structure, onsets/decay asymmetry, and capture inharmonic features important for instrument recognition, though best case performance still closely tracks fixed spectrogram baselines (Cwitkowitz et al., 2021).
- Speaker Verification: Learnable frequency filters with flexible (narrow-band) parameterizations offer gains over mel filterbanks and even low-complexity SincNet/Gabor frontends in low-latency or resource-constrained settings, by emphasizing discriminative spectral detail (Li et al., 2022).
- Biomedical and Environmental Audio: Domain-specific, learnable FIR or gammatone convolutional filters improve abnormal heart sound detection and robustness to sensor variability, with learned frequency responses displaying adaptive suppression of irrelevant bands and better waveform preservation (Humayun et al., 2019).
- Image Denoising/Analysis: In non-audio contexts, convolutional filterbank sparsifying transforms outperform patch-based methods for image restoration, showing flexibility and scalability to large signals and interpretable learned atoms (Pfister et al., 2018, Zazo et al., 2019).
A consistent empirical finding is that fixed mel-initializations remain near-optimal for many front-ends, with only moderate deviation during end-to-end learning unless explicitly encouraged. Nevertheless, in domains with severe SNR or spectral variability, learned banks can substantially outperform static baselines by reallocating resources to clean, information-rich regions (López-Espejo et al., 2022, Elsborg et al., 29 May 2025).
5. Robustness, Adaptivity, and Limitations
While end-to-end filterbank learning can yield robust, noise-adaptive representations, there are nontrivial limitations:
- Limited Filter Movement: Studies leveraging Jensen–Shannon metrics on learned filters demonstrate that, barring explicit intervention, filterbanks initialized to mel/bark/linear scales undergo minimal parameter drift, and most of the empirical gain stems from fine-tuning compression or normalization, not the band shapes themselves (Anderson et al., 2023, Schlüter et al., 2022).
- Static vs. Adaptive Front-ends: Static learnable filterbanks (parameters frozen post-training) exhibit vulnerability to nonstationary environments, motivating the development of adaptive methods such as Ada-FE, which introduces a neural feedback controller dynamically adjusting filter Q-factors according to context (level, modulation patterns). Ada-FE shows marked stability, faster convergence, and accuracy gains across environmental and speech-music-bioacoustic datasets (Zhang et al., 5 Feb 2025).
- Computational and Practical Trade-offs: EfficientLEAF and similar architectures demonstrate that staged architectural and per-filter optimizations (e.g., grouped strides, adaptable convolutional window lengths, parallelizable normalization) can recover most of the accuracy of full learnable front-ends at a small fraction of computational cost—often making them preferable for on-device inference or long-sequence processing (Schlüter et al., 2022, Elsborg et al., 29 May 2025).
- Interpretability: Data-driven, experience-guided learning (e.g., post-hoc smoothing and reinitialization of learned banks) can enhance both accuracy and interpretability, allowing extraction of domain-realistic, human-interpretable filters that reflect the spectral idiosyncrasies of the task (Qu et al., 2016).
6. Perspectives, Future Directions, and Open Questions
Recent research points to multiple areas of ongoing or future exploration:
- Optimization Strategies: There is a recognized need for optimization algorithms better suited to filter learning, including filterbank-only pretraining, regularizations favoring movement from initializations, and search–run–select pipelines to escape local minima imposed by strong auditory priors (Anderson et al., 2023).
- Perfect-Reconstruction Learnable Filterbanks: Designs such as ISAC enable tight, invertible, and perceptually-motivated frontends, facilitating training as analysis–synthesis pairs and supporting downstream wave-to-wave or feature-to-wave architectures (Haider et al., 12 May 2025, Pfister et al., 2018).
- Hybrid Analytic-/Data-Driven Schemes: Hybrid approaches, combining domain knowledge (e.g., gammatone shapes, psychoacoustics) with unconstrained filter learning and adaptive control, are promising for both accuracy and generalizability, especially in high-variability, cross-domain settings (Humayun et al., 2019, Zhang et al., 5 Feb 2025, Haider et al., 12 May 2025).
- Task- and Domain-specific Adaptation: Studies establish that in contexts with strong spatiotemporal distortions, learnable filterbanks can focus on frequency bands less affected by noise or attenuation, optimizing the representation for on-device, always-on, or variable-sensor scenarios (López-Espejo et al., 2022, Elsborg et al., 29 May 2025).
- Role of Adaptivity at Inference: Modulating filter parameters during inference (adaptive Q, feedback) is emerging as a mechanism to approach human-like auditory robustness, surpassing both handcrafted and statically learned filters in dynamic or unseen conditions (Zhang et al., 5 Feb 2025).
The overall trajectory of the field is toward increasingly modular, adaptive, and analytically flexible filterbanks, with tight integration into end-to-end learning, scalability to large data regimes, and robust performance under shifting real-world conditions. Theoretical and empirical understanding of when adaptation truly outperforms well-tuned mel-scale features remains open and is a primary subject of contemporary research.
Key References:
- (Zeghidour et al., 2017): Learning Filterbanks from Raw Speech for Phone Recognition
- (López-Espejo et al., 2020): Exploring Filterbank Learning for Keyword Spotting
- (López-Espejo et al., 2022): Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting
- (Qu et al., 2016): Learning Filter Banks Using Deep Learning For Acoustic Signals
- (Pfister et al., 2018): Learning Filter Bank Sparsifying Transforms
- (Elsborg et al., 29 May 2025): Acoustic Classification of Maritime Vessels using Learnable Filterbanks
- (Schlüter et al., 2022): EfficientLEAF: A Faster LEarnable Audio Frontend of Questionable Use
- (Haider et al., 12 May 2025): ISAC: An Invertible and Stable Auditory Filter Bank with Customizable Kernels for ML Integration
- (Pariente et al., 2019): Filterbank design for end-to-end speech separation
- (Anderson et al., 2023): Learnable Frontends that do not Learn: Quantifying Sensitivity to Filterbank Initialisation
- (Zhang et al., 5 Feb 2025): Should Audio Front-ends be Adaptive? Comparing Learnable and Adaptive Front-ends