On Frequency-Wise Normalizations for Better Recording Device Generalization in Audio Spectrogram Transformers (2306.11764v1)
Abstract: Varying conditions between the data seen at training and at application time remain a major challenge for machine learning. We study this problem in the context of Acoustic Scene Classification (ASC) with mismatching recording devices. Previous works successfully employed frequency-wise normalization of inputs and hidden layer activations in convolutional neural networks to reduce the recording device discrepancy. The main objective of this work was to adopt frequency-wise normalization for Audio Spectrogram Transformers (ASTs), which have recently become the dominant model architecture in ASC. To this end, we first investigate how recording device characteristics are encoded in the hidden layer activations of ASTs. We find that recording device information is initially encoded in the frequency dimension; however, after the first self-attention block, it is largely transformed into the token dimension. Based on this observation, we conjecture that suppressing recording device characteristics in the input spectrogram is the most effective. We propose a frequency-centering operation for spectrograms that improves the ASC performance on unseen recording devices on average by up to 18.2 percentage points.
- Y. Gong, Y. Chung, and J. R. Glass, “AST: audio spectrogram transformer,” in 22nd Annual Conf. of the Int. Speech Communication Association, Interspeech, 2021.
- K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient training of audio transformers with patchout,” in 23rd Annual Conf. of the Int. Speech Communication Association, Interspeech, 2022.
- S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” CoRR, vol. abs/2212.09058, 2022.
- A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” in Proc. of the Detection and Classification of Acoustic Scenes and Events Workshop, DCASE, 2018.
- Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computing, vol. 1, no. 4, pp. 541–551, 1989.
- H. Nam and H. Kim, “Batch-instance normalization for adaptively style-invariant neural networks,” in Advances in Neural Information Processing Systems 31: Annual Conf. on Neural Information Processing Systems, NeurIPS (S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), 2018.
- S. Choi, T. Kim, M. Jeong, H. Park, and C. Kim, “Meta batch-instance normalization for generalizable person re-identification,” in IEEE Conf. on Computer Vision and Pattern Recognition, CVPR, 2021.
- X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning and generalization capacities via ibn-net,” in 15th European Conference on Computer Vision, ECCV, 2018.
- D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” CoRR, vol. abs/1607.08022, 2016.
- B. Kim, S. Yang, J. Kim, H. Park, J. Lee, and S. Chang, “Domain generalization with relaxed instance frequency-wise normalization for multi-device acoustic scene classification,” in 23rd Annual Conf. of the Int. Speech Communication Association, Interspeech, 2022.
- L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016.
- Y. Gong, C. Lai, Y. Chung, and J. R. Glass, “SSAST: self-supervised audio spectrogram transformer,” in 36th Conf. on Artificial Intelligence, AAAI, 34th Conf on Innovative Applications of Artificial Intelligence, IAAI, 12th Symposium on Educational Advances in Artificial Intelligence, EAAI, 2022.
- D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” vol. abs/2204.12768, 2022.
- D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation,” vol. abs/2204.12260, 2022.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th Int. Conf. on Learning Representations, ICLR, 2021.
- J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process., ICASSP, 2017.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd Int. Conf. on Learning Representations, ICLR, 2015.
- J. Turian, J. Shier, H. R. Khan, B. Raj, B. W. Schuller, C. J. Steinmetz, C. Malloy, G. Tzanetakis, G. Velarde, K. McNally, M. Henry, N. Pinto, C. Noufi, C. Clough, D. Herremans, E. Fonseca, J. H. Engel, J. Salamon, P. Esling, P. Manocha, S. Watanabe, Z. Jin, and Y. Bisk, “HEAR: holistic evaluation of audio representations,” in NeurIPS 2021 Competitions and Demonstrations Track, NeurIPS, 2021.
- K. J. Piczak, “ESC: dataset for environmental sound classification,” in Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015 (X. Zhou, A. F. Smeaton, Q. Tian, D. C. A. Bulterman, H. T. Shen, K. Mayer-Patel, and S. Yan, eds.), pp. 1015–1018, ACM, 2015.
- E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 30, pp. 829–852, 2022.
- S. Cooper and S. Shaw, “Gunshots recorded in an open field using ipod touch devices.” https://doi.org/10.5061/dryad.wm37pvmkc, 2020.
- H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in 6th Int. Conf. on Learning Representations, ICLR, 2018.
- D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in 20th Annual Conf. of the Int. Speech Communication Association, Interspeech, 2019.