MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

Published 31 May 2026 in cs.SD | (2606.01009v1)

Abstract: Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally heterogeneous. This mismatch can introduce memory-bandwidth, dispatch, and intermediate-allocation overheads on contemporary accelerator backends. This work introduces MelT, a single-stage frontend framework in which Mel-spaced Non-Uniform Discrete Fourier Transform (NDFT) bases are precomputed and applied to time-domain acoustic frames through dense General Matrix Multiplication (GEMM) operations. The contribution is not the NDFT operator itself; rather, it is the formulation of Mel-spaced NDFT projection as a GEMM-native audio frontend and its evaluation as a hardware-efficient alternative to conventional STFT+Mel pipelines. Evaluated across platforms ranging from Apple A18 Pro edge hardware to NVIDIA H100 datacenter acceleration, MelT attains up to a $3.75\times$ speedup in inference latency and a $3.52\times$ reduction in energy consumption while maintaining downstream classification accuracy.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper proposes a single-stage audio frontend, MelT, that reformulates Mel-frequency extraction as dense matrix multiplications, achieving significant latency and energy reductions.
It replaces the conventional STFT+Mel pipeline with a GEMM-native NDFT, aligning computation with modern accelerator capabilities and optimizing performance across diverse hardware.
Empirical evaluations confirm that MelT maintains competitive classification accuracy while delivering faster inference and lower energy consumption on both edge and datacenter platforms.

MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

Overview

The paper presents MelT, a single-stage audio frontend framework that leverages dense General Matrix Multiplication (GEMM)-based Non-Uniform Discrete Fourier Transform (NDFT) to directly project acoustic frames into Mel-spaced frequency coordinates. By encoding the Mel-spaced NDFT as dense matrix operations, MelT exploits the architectural strengths of modern deep learning accelerators, such as GPUs and specialized matrix coprocessors, in contrast to the conventional STFT+Mel frontend which aligns computation with FFT rather than GEMM engines. The study empirically demonstrates substantial reductions in inference latency and energy consumption across edge and datacenter hardware, while maintaining practical classification accuracy on benchmark tasks.

Motivation and Methodological Innovations

Contemporary audio models, including for speech recognition and audio event classification, are increasingly deployed on hardware optimized for dense linear algebra. However, the prevailing STFT+Mel frontend pipeline was engineered for computational efficiency on classical CPUs and now creates execution inefficiencies on modern accelerators because it is multi-staged, incurs significant memory movement, and underutilizes matrix multiplication engines.

MelT addresses this by recasting the Mel-frequency spectral projection as a direct NDFT at Mel-spaced frequencies, structured for accelerator execution via GEMM. The proposed pipeline computes the Mel projection for each frame by applying a precomputed basis of sinusoidal functions at nonuniform (Mel) frequencies via matrix multiplication. The representation is extended to cepstral domains (MFCCT) by appending log compression and an orthonormal DCT-II transform, mirroring the MFCC pipeline but with all major operations expressible as dense matrix products.

Hardware-Efficient Formulation

The established MelT pipeline operates as follows: after standard frame blocking and windowing, each signal frame is projected via explicit dense matmuls to compute, for each of $M$ Mel bins, real and imaginary projections,

$R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$

where $f_m$ are the Mel-spaced target frequencies obtained through the inverse Mel mapping. The resulting energy matrix $\mathbf{S}$ is constructed as $S_{t,m} = R_{t,m}^2 + I_{t,m}^2$ . The log-compressed $\mathbf{S}$ constitutes the MelT spectrogram, and its DCT-II transform yields the MFCCT cepstral representation. Critically, all major frontend steps are designed as dense GEMM operations, well-suited for high-throughput execution on modern accelerators.

The pipeline design is not an algebraic rearrangement of STFT followed by Mel aggregation, but a distinct projection sequence that omits intermediate linear-frequency spectra and sparse filterbank aggregation, yielding coherent Mel-space projections.

Empirical Evaluation and Results

Latency and Throughput

Comprehensive benchmarking was performed on Apple A18 Pro (edge), Apple M4 Pro (workstation), NVIDIA V100 (legacy datacenter), and NVIDIA H100 (modern datacenter). MelT attains significant reduction in inference latency compared to STFT+Mel, scalable with input duration and maximal for longer contexts. On the Apple A18 Pro, MelT achieves up to $3.75\times$ latency speedup for 160-second utterances. Datacenter GPUs, such as the H100, observe a $1.92\times$ reduction, reflecting backend-specific dispatch and memory dynamics.

Figure 1: Latency scaling (Panel A) and corresponding speedup (Panel B) for MelT versus the conventional STFT+Mel pipeline, exhibiting consistent acceleration as input duration increases.

Energy Consumption

Energy profiling shows that frontend computation energy reductions can exceed latency gains, primarily when the accelerator's power draw also decreases under MelT. On the NVIDIA H100, MelT reduces median chip-level power demand from 438.1 W (STFT+Mel) to 309.0 W, combining with latency reduction for a $2.74\times$ minimum energy cut. On the A18 Pro, MelT reduces the energy per forward pass by $3.52\times$ due almost entirely to shorter execution time, as power draw remains constant across pipelines.

Figure 2: Hardware energy consumption in millijoules over input size, illustrating substantial reductions with MelT compared to the traditional frontend.

Representation Fidelity and Downstream Task Performance

The coherence discrepancy between MelT and STFT+Mel (non-equivalence due to ordering of projection and energy computation) does not induce meaningful representation drift in practice: frame-level feature similarity (cosine) remains within $R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$ 0.

On standard downstream tasks:

VoxCeleb1 gender classification: MFCCT matches standard MFCC accuracy within 0.2% (97.84% vs. 97.95%). Cross-evaluation demonstrates only minor degradation when training on STFT+Mel and evaluating on MelT-based features (88.91% vs. 88.81%).
SPIRA COVID-19 respiratory classification: MFCCT produces an F1 of 0.9860 vs. MFCC’s 0.9737.

Scaling with Mel Bin Count

Performance gain from MelT is sensitive to $R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$ 1, the number of Mel bins, due to the $R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$ 2 arithmetic pattern. As $R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$ 3 grows large, the speedup diminishes—approaching parity with STFT+Mel at $R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$ 4—but practical configurations (e.g., $R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$ 5 for ASR) retain $R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$ 6 acceleration.

Implications for Hardware-Software Co-Design

A central implication highlighted by this work is that the dominant factor in realized latency and energy efficiency for audio frontends on modern accelerators is shifting from algorithmic FLOP counts to hardware-execution path alignment. By recasting frontend computation as a unified dense matmul (GEMM), MelT aligns processing with accelerator strengths—kernel launch reduction, regular compute patterns, improved cache and memory throughput—enabling substantial practical gains often obscured by asymptotic complexity analysis.

The benefits extend to cepstral representations (MFCCT), indicating that the gains stem from the execution structure, not from specific frontend invariants.

The observed platform- and configuration-dependence of performance gains underscores the necessity for hardware-software co-design when engineering inference-critical frontend pipelines, especially in the context of proliferating matrix-native accelerators in mobile and server environments.

Future Research Directions

Several avenues arise from this work:

Learnable Mel-Space Projections: While MelT uses fixed Mel spacing, parameterizing the NDFT bases or incorporating learnable frequency positions could be explored for task-specific supervision.
Generalization to Other Spectral Representations: Extending the GEMM-NDFT pipeline to non-Mel and non-audio domains, including music and non-speech signals, could validate hardware-native benefits universally.
Hardware-Aware Parameter Optimization: Characterizing the $R_{t,m} = \sum_n \tilde{x}_t[n] \cos \left( 2\pi f_m n / f_s \right), \quad I_{t,m} = \sum_n \tilde{x}_t[n] \sin \left( 2\pi f_m n / f_s \right)$ 7-scaling crossover analytically and empirically to aid in frontend parameter selection co-optimized for both accuracy and efficiency.
Incorporation into End-to-End Systems: Integrating MelT and variants in trainable frontend layers for speech recognition, event detection, or audio self-supervised pretraining for system-level impact quantification.

Conclusion

This study demonstrates that MelT, by reframing Mel-spaced spectral feature extraction as a dense matmul operation, achieves marked reductions in latency and energy consumption relative to conventional pipelines, especially on accelerators where dense linear algebra is preferentially optimized. These advantages persist for common frontend configurations and do not sacrifice downstream classification performance, as evidenced on both speech and clinical tasks. The findings advocate for continued re-examination and redesign of classical signal-processing modules from a hardware-centric standpoint, with clear practical benefits for deep learning inference at scale.

Markdown Report Issue