Spectrogram-Based Loss Functions

Updated 3 January 2026

Spectrogram-based loss functions are mathematical formulations that compare STFT representations of reference and generated signals to align learning with human auditory perceptions.
They leverage multi-resolution designs with varied FFT sizes, window lengths, and hop sizes to capture fine temporal details and frequency structures in audio.
Recent perceptually weighted and phase-aware extensions improve objective metrics and subjective audio quality by emphasizing frequency-specific and phase information.

Spectrogram-based loss functions, particularly those employing multi-resolution short-time Fourier transform (MR-STFT) criteria, have become fundamental in modern speech generation, enhancement, and super-resolution models. These losses directly compare network output and reference signals in the time-frequency domain, leveraging signal processing principles to better align machine learning objectives with perceptual and physical characteristics of audio. Advanced formulations incorporate frequency-dependent perceptual weightings, phase-aware terms, or multi-branch network architectures, leading to robust improvements in both objective and subjective speech quality metrics.

1. Mathematical Formulation of Spectrogram-Based Losses

The canonical spectrogram-based loss is constructed on the STFT transform of a signal. For a waveform $x(t)$ and generated output $\hat{x}(t)$ , the STFT at a given resolution $m$ is denoted $\mathrm{STFT}^{(m)}\{x\}(k, n)$ , where $k$ is the frequency bin and $n$ is the frame index. Two loss terms are standard:

Spectral convergence:

$L_\mathrm{sc}^{(m)}(x, \hat{x}) = \frac{ \|\,| \mathrm{STFT}^{(m)}(x) | - | \mathrm{STFT}^{(m)}(\hat{x}) |\,\|_F }{ \|\,| \mathrm{STFT}^{(m)}(x) |\,\|_F }$

where $\|\cdot\|_F$ is the Frobenius norm.

Log-magnitude loss:

$L_\mathrm{mag}^{(m)}(x, \hat{x}) = \frac{1}{N^{(m)}} \| \log |\mathrm{STFT}^{(m)}(x)| - \log |\mathrm{STFT}^{(m)}(\hat{x})| \|_1$

where $\|\cdot\|_1$ is the element-wise L1 norm and $N^{(m)}$ is the total number of time–frequency bins.

The per-resolution loss is $L_s^{(m)} = L_\mathrm{sc}^{(m)} + L_\mathrm{mag}^{(m)}$ . In the multi-resolution setting, the overall auxiliary loss is typically the average:

$L_\mathrm{aux} = \frac{1}{M}\sum_{m=1}^{M} L_s^{(m)}$

as detailed in (Yamamoto et al., 2019, Song et al., 2021, Wan et al., 2023, Tamiti et al., 30 Jun 2025, Shi et al., 2023).

2. Multi-Resolution Loss Design and Parameters

Multiresolution designs incorporate several STFT parameterizations. For example, (Yamamoto et al., 2019) and (Song et al., 2021) use three STFTs with different frame lengths and hops to balance spectral and temporal resolution:

STFT Index $m$	FFT Size $N$	Window Length (ms)	Hop Size (ms)	Window Type
1	512	10	~2	Hanning
2	1024	25	5	Hanning
3	2048	50	10	Hanning

(Tamiti et al., 30 Jun 2025) utilizes square-root Hann windows of 256/512/1024 samples with corresponding hops for perfect reconstruction. The rationale for this configuration, as stated in (Yamamoto et al., 2019), is that different window/hop/FFT sizes emphasize either fine time structure (transients, rapid spectral changes) or frequency structure (formants, harmonics).

3. Integration into Network Training Objectives

Spectrogram-based losses are rarely used in isolation; instead, they are incorporated as auxiliary criteria to complement adversarial, time-domain, or perception-related losses. A typical objective in a waveform generator (e.g., Parallel WaveGAN) is:

$L_G(G, D) = L_\mathrm{aux}(G) + \lambda_\mathrm{adv} L_\mathrm{adv}(G, D)$

with $\lambda_\mathrm{adv}$ a tunable scalar (Yamamoto et al., 2019, Song et al., 2021).

In speech enhancement and super-resolution, it is common to combine MR-STFT loss with time-domain losses (e.g., L1/MAE or scale-invariant SI-SDR) and semantic/ASR-based losses. For instance, the MNTFA model (Wan et al., 2023) uses:

$L_\mathrm{total} = L_\mathrm{MSE} + L_\mathrm{aux} + L_\mathrm{ASR}$

where $L_\mathrm{ASR}$ supervises embeddings through a pre-trained model.

(Shi et al., 2023) introduces multi-branch decoder heads, each optimized with its own resolution-matched loss, with the final output being the mean waveform.

4. Perceptually Weighted and Phase-Aware Extensions

To better align loss sensitivity with human perception, (Song et al., 2021) introduces frequency-dependent weighting derived from averaged linear prediction (LP) spectra. The weight matrix $W_{t, f}$ amplifies error terms in perceptually sensitive frequency regions:

$L_{\rm sc}^w(\mathbf{x}, \hat{\mathbf{x}}) = \frac{\sqrt{ \sum_{t, f} [W_{t, f}(|X_{t, f}| - |\hat{X}_{t, f}|)]^2 }}{ \sqrt{ \sum_{t, f} | X_{t, f} |^2 } }$

A similar adjustment applies to the log-magnitude term.

CTFT-Net (Tamiti et al., 30 Jun 2025) demonstrates phase-awareness by operating in the complex STFT domain and explicitly supervising only real-part spectral components, observing empirically beneficial trade-offs for frequency and phase fidelity. Time-domain SI-SDR further encourages accurate phase reconstruction, as misalignment in phase reduces time-domain fidelity.

5. Empirical Impacts and Ablative Evidence

MR-STFT losses consistently yield significant improvements across metrics:

In Parallel WaveGAN, switching from single- to multi-resolution STFT yields major MOS gains, e.g., from 1.36 (single) to 4.06 (MR-STFT+adv) (Yamamoto et al., 2019).
(Tamiti et al., 30 Jun 2025) reports a 30% reduction in log-spectral distance (LSD) when moving from single- to multi-resolution spectral supervision for 2→48 kHz speech super-resolution, alongside improvements in STOI, PESQ, and SI-SDR.
MNTFA’s ablation with/without MR-STFT loss shows a +0.10 increase in PESQ and +0.49% in STOI (Wan et al., 2023).
Introduction of perceptual weighting in (Song et al., 2021) yields MOS improvements of +0.24 for female and +0.10 for male speakers compared to unweighted MR-STFT, specifically reducing high-frequency noise in spectral valleys.

(Shi et al., 2023) demonstrates that supplying multiple strongly stationary spectrogram resolutions to the encoder and associating matched losses to parallel decoders adds +0.14 (absolute) to PESQ on VoiceBank.

6. Architectural Variants Leveraging Spectrogram Losses

Models exploiting spectrogram-based losses have developed several architectural innovations.

Auxiliary MR-STFT Loss: Used as a secondary or “co-supervision” loss in fast waveform generation (Yamamoto et al., 2019, Song et al., 2021).
Multi-Branch Decoders: Each branch is supervised using a specific resolution, with all outputs fused for the final waveform (Shi et al., 2023). This reduces conflicting gradients and improves feature learning.
Time-Frequency Attention: Attention over time and frequency axes is explicitly integrated to further exploit spectrogram-based supervision (Wan et al., 2023).
Perceptually Weighted STFT Losses: Training targets can be focused on perceptually sensitive frequency regions derived from speech statistics (Song et al., 2021), emphasizing auditory salience.

7. Limitations and Open Challenges

Single-output networks often struggle to satisfy the potentially conflicting gradients arising from losses at multiple resolutions (Shi et al., 2023). Using separate decoder branches or carefully selecting "stationary" resolutions in the encoder can partially address this. Determining optimal STFT configurations remains empirical, as does the trade-off between log-magnitude and spectral convergence terms. Another challenge involves phase reconstruction: while SI-SDR and complex-domain losses indirectly improve phase, most MR-STFT formulations remain magnitude-centric. Recent works (e.g., (Tamiti et al., 30 Jun 2025)) suggest that careful selection of which spectral components to supervise (e.g., real vs. imaginary) yields the best trade-off for perceptual and objective scores.

In summary, spectrogram-based loss functions, especially multi-resolution STFT losses and their perceptually aware extensions, are now central to state-of-the-art speech and audio models. They exploit the local stationarity and auditory-relevant patterns in time-frequency representations to drive consistent improvements in both algorithmic and human-perceptual metrics across waveform generation, enhancement, and super-resolution tasks (Yamamoto et al., 2019, Song et al., 2021, Tamiti et al., 30 Jun 2025, Wan et al., 2023, Shi et al., 2023).