Reason discriminator sensitivity bias impacts transformer-based codecs more than convolutional codecs

Investigate the cause of the observed systematic sensitivity bias in multi-resolution STFT discriminators that induces periodic artifacts, and ascertain why this bias affects the Transformer Audio AutoEncoder (TAAE) architecture, which is predominantly transformer-based, more strongly than prior convolutional codec architectures (e.g., SoundStream/Encodec-style CNNs).

Background

The paper analyzes the sensitivity of discriminator losses with respect to input signals and reveals systematic bias in multi-resolution STFT-based discriminators; freshly initialized discriminators exhibit strong horizontal and vertical lines in sensitivity spectra, indicating gradients biased toward specific times and frequencies. This bias correlates with periodic artifacts in reconstructions, particularly in higher frequencies.

While adjusting STFT resolutions to inharmonic sizes (near the golden ratio) mitigates these effects, the authors note that transformer-based architectures (TAAE) appear more susceptible to this discriminator-induced bias than earlier convolutional codecs, and the reason for this discrepancy remains to be examined.

References

A deeper examination of the reason why this bias effects a transformer-based architecture more than previous convolutional architectures is left to future work.

— Scaling Transformers for Low-Bitrate High-Quality Speech Coding (2411.19842 - Parker et al., 29 Nov 2024) in Appendix: Systematic bias in loss functions

Reason discriminator sensitivity bias impacts transformer-based codecs more than convolutional codecs

Sponsor

Background

References

Related Problems