Learnable MFCCs for Adaptive Feature Extraction

Updated 9 February 2026

The paper demonstrates that replacing static MFCC stages with learnable, parameterized transforms significantly improves performance in tasks like speaker verification and anomaly detection.
Learnable MFCCs are defined by replacing the fixed windowing, DFT, Mel filterbank, and DCT with differentiable, regularized linear transforms to preserve signal-processing interpretability.
Empirical results show relative EER reductions up to 9.7% and substantial F1 score improvements, validating the approach of end-to-end optimization in deep networks.

Learnable MFCCs (Mel-frequency cepstral coefficients) are a family of differentiable, data-adaptive feature extraction pipelines that generalize the classic MFCC representation by parameterizing one or more stages as learnable matrices within a deep network. This approach maintains the structure, interpretability, and signal-processing priors of traditional MFCCs while allowing the transforms—windowing, discrete Fourier transform (DFT), Mel filterbank, and discrete cosine transform (DCT)—to be optimized directly for downstream discrimination tasks such as speaker verification or anomaly detection. By integrating these learnable transforms into end-to-end training pipelines, performance gains are achieved without sacrificing the prior knowledge embedded in the traditional feature engineering (Liu et al., 2021, Lee et al., 14 Jul 2025).

1. Parametric Structure of Learnable MFCCs

The learnable MFCC pipeline replaces the four canonical MFCC stages with differentiable, parameterized linear transforms. For each frame $x[n]$ :

Windowing: The fixed Hamming window $h[n]$ is replaced by a learnable vector $w[n] \in \mathbb{R}^M$ , initialized to Hamming and regularized for smoothness and symmetry:

$\tilde{x}[n] = w[n] \cdot x[n]$

Regularization preserves cosine-like shape and optional nonnegativity/symmetry constraints can be enforced after each optimizer step.

Discrete Fourier Transform: The DFT is decomposed into real-valued linear maps $F_{\rm real}, F_{\rm imag} \in \mathbb{R}^{K \times M}$ :

$X_{\rm real} = F_{\rm real}\,\tilde{x}, \quad X_{\rm imag} = F_{\rm imag}\,\tilde{x}$

The power spectrum is $P[k] = X_{\rm real}[k]^2 + X_{\rm imag}[k]^2$ , with initialization from the analytic DFT and symmetry-encouraging regularizers.

Mel Filterbank: The Mel-scale triangular filterbank is replaced by a learnable nonnegative matrix $M_{m,k} \geq 0$ :

$y[m] = \sum_{k=0}^{K-1} M_{m,k}\,P[k]$

Initialized from classic Mel weights, $M$ is regularized via $h[n]$ 0 norm and clamped to remain nonnegative.

DCT Projection: The DCT-II is replaced by a trainable orthonormal matrix $h[n]$ 1:

$h[n]$ 2

$h[n]$ 3 is initialized to the DCT-II and encouraged to remain orthonormal via soft constraints and optional QR projection.

This parametric structure enables each stage to be adapted to the data while maintaining a structurally interpretable mapping (Liu et al., 2021).

2. End-to-End Differentiability and Training Objectives

Each MFCC stage is implemented as a differentiable operation, allowing gradients to propagate from downstream task losses to the transform parameters:

Linear stages (windowing, DFT, DCT) are matrix multiplications.
Nonlinearities (power spectrum, logarithm) are elementwise operations, with attention paid to subgradient definition at non-differentiable points (e.g., add $h[n]$ 4).
Regularizers matched to each transform enforce smoothness, nonnegativity, symmetry, or orthonormality without explicit hard projection except when necessary.

A typical loss function combines a cross-entropy task loss (e.g., for speaker identity classification or anomaly discrimination) with regularization terms:

$h[n]$ 5

with weighting factors (e.g., $h[n]$ 6 for all terms in (Liu et al., 2021)). Optimization is performed via Adam with standard learning rates and batch sizes, often beginning from a pretrained model and fine-tuning each MFCC component in isolation before potential joint adaptation (Liu et al., 2021).

3. Integration with Deep Learning Pipelines

Learnable MFCCs are directly integrated as the initial layers within deep architectures for a variety of tasks:

In speaker verification, the 30-dimensional cepstral vectors are fed to the x-vector TDNN network, processed via standard SAD and CMN, then passed through PLDA back-ends and other normalization steps. The learnable MFCC stack is agnostic to downstream pipeline design and does not require additional computational overhead or tuning once initialized (Liu et al., 2021).
In network anomaly detection, learnable MFCCs preprocess raw IoT signal streams into cepstral “images,” which are then consumed by 2D CNNs such as ResNet-18. Pools of temporally-normalized cepstral frames enable spatial convolutional feature learning. Delta and delta-delta coefficients can be appended as additional input channels if required (Lee et al., 14 Jul 2025).

A key property is that the MFCC front-end is fully differentiable, allowing end-to-end joint optimization of spectral representation and classifier, with the potential for transfer to any task centered on time-series or spectral analysis.

4. Empirical Performance and Ablations

Quantitative experiments demonstrate the efficacy of learnable MFCCs over static, hand-crafted baselines. For speaker verification (Liu et al., 2021):

System	VoxCeleb1-test	SITW-DEV
Baseline static MFCC	4.64	6.72
Learnable window	4.40 (–5.2%)	6.09 (–9.4%)
Learnable DFT	4.33 (–6.7%)	6.35 (–5.5%)
Learnable mel-bank	4.45 (–4.1%)	6.31 (–6.2%)
Learnable DCT	4.36 (–6.0%)	6.27 (–6.7%)

Relative EER reductions up to 6.7% (VoxCeleb1) and 9.7% (SITW) over static MFCCs are achieved by tuning individual stages. Ablation studies demonstrate that domain-informed regularizers (nonnegativity, symmetry, orthonormality) meaningfully preserve interpretability while enabling adaptation for discrimination.

For IoT anomaly detection (Lee et al., 14 Jul 2025), comparison of fixed vs. learnable MFCC feature extraction with a ResNet-18 back-end yields:

Dataset	Fixed MFCC F1	Learnable MFCC F1
IoTID20	99.00%	99.90%
CICIoT2023	72.82%	99.38%
NSL-KDD	99.71%	100.00%

The most pronounced absolute gain (+26.6 percentage points) is seen in the CICIoT2023 dataset compared to raw tabular features, with additional improvements when fine-tuning learnable parameters. A plausible implication is that this approach generalizes well across tasks where the frequency emphasis and cepstral projection differ from prior-engineered speech models.

5. Interpretability, Domain Bias, and Practical Deployment

A defining property of learnable MFCCs is the combination of data-adaptivity and interpretability. By initializing from standard transforms and softly regularizing toward domain priors (e.g., cosine-shaped windows, DFT structure, Mel-scale nonnegativity, DCT orthogonality), the learned transforms maintain a physical interpretation and can be inspected or visualized as signal processing modules.

This hybrid of "hand-crafted" and end-to-end learning preserves the inductive biases crucial for small-data or transfer learning regimes, while still enabling adaptation to non-speech domains (IoT, biomedical time series, etc.) (Liu et al., 2021, Lee et al., 14 Jul 2025). In terms of computational cost, the front-end is negligible compared with the classifier, and light fine-tuning (∼1000 steps) suffices.

The pipeline is compatible with conventional normalization and post-processing (CMN/CMVN, temporal pooling), and can be augmented with additional normalization or variance-reduction steps prior to feeding the representation to DNNs.

6. Extensions, Applications, and Outlook

Learnable MFCCs have direct applications in speech speaker verification, robust network anomaly detection, and any domain where spectral content is semantically relevant. Practical extension possibilities include:

Joint end-to-end adaptation of all four stages (or subset thereof) for task-specific optimization.
Integration into high-capacity embedding back-ends such as extended-TDNN or large-scale ResNet for further gains.
Rapid domain adaptation of pretrained MFCC-based systems through lightweight fine-tuning.
Extension to non-speech and multivariate time-series domains where raw signal priors or filter shapes differ substantially from conventional speech.

As evidenced by empirical results on both classic and emerging datasets, learnable MFCCs provide a flexible, interpretable, and effective alternative to both fixed-feature and black-box representation learning (Liu et al., 2021, Lee et al., 14 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Learnable MFCCs for Speaker Verification (2021)

Spectral Feature Extraction for Robust Network Intrusion Detection Using MFCCs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learnable MFCCs.

Learnable MFCCs for Adaptive Feature Extraction

1. Parametric Structure of Learnable MFCCs

2. End-to-End Differentiability and Training Objectives

3. Integration with Deep Learning Pipelines

4. Empirical Performance and Ablations

5. Interpretability, Domain Bias, and Practical Deployment

6. Extensions, Applications, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Learnable MFCCs for Adaptive Feature Extraction

1. Parametric Structure of Learnable MFCCs

2. End-to-End Differentiability and Training Objectives

3. Integration with Deep Learning Pipelines

4. Empirical Performance and Ablations

5. Interpretability, Domain Bias, and Practical Deployment

6. Extensions, Applications, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research