Analysis of discriminator’s learned focus on inaudible spectral regions during training

Analyze and characterize why, during late training, the multi-resolution STFT discriminator’s loss becomes dominated by extremely low-magnitude (inaudible) spectral regions when training the Transformer Audio AutoEncoder (TAAE), and develop principled methods to address this learned bias without degrading timbre or intelligibility.

Background

During late-period training, sensitivity analyses showed the discriminator loss is mainly influenced by very low-magnitude parts of the spectrum, suggesting it differentiates real versus fake audio based on patterns in inaudible regions. The authors propose a power-law scaling of STFT magnitudes to counteract this tendency (with α=1/2 working best empirically), but acknowledge that deeper analysis is needed to understand and address the underlying bias.

The issue directly impacts training dynamics and perceptual fidelity; a principled resolution could improve reconstruction quality and stability across different architectures and datasets.

References

A more involved analysis for addressing this issue is left to future work.

Scaling Transformers for Low-Bitrate High-Quality Speech Coding (2411.19842 - Parker et al., 29 Nov 2024) in Appendix: Systematic bias in loss functions, subsection Learned bias during training