On the choice of the optimal temporal support for audio classification with Pre-trained embeddings (2312.14005v1)
Abstract: Current state-of-the-art audio analysis systems rely on pre-trained embedding models, often used off-the-shelf as (frozen) feature extractors. Choosing the best one for a set of tasks is the subject of many recent publications. However, one aspect often overlooked in these works is the influence of the duration of audio input considered to extract an embedding, which we refer to as Temporal Support (TS). In this work, we study the influence of the TS for well-established or emerging pre-trained embeddings, chosen to represent different types of architectures and learning paradigms. We conduct this evaluation using both musical instrument and environmental sound datasets, namely OpenMIC, TAU Urban Acoustic Scenes 2020 Mobile, and ESC-50. We especially highlight that Audio Spectrogram Transformer-based systems (PaSST and BEATs) remain effective with smaller TS, which therefore allows for a drastic reduction in memory and computational cost. Moreover, we show that by choosing the optimal TS we reach competitive results across all tasks. In particular, we improve the state-of-the-art results on OpenMIC, using BEATs and PaSST without any fine-tuning.
- “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP. IEEE, 2017, pp. 776–780.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
- “Fine-tuning strategies for faster inference using speech self-supervised models: A comparative study,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “Cosmopolite sound monitoring (cosmo): A study of urban sound event detection systems generalizing to multiple cities,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “An attention mechanism for musical instrument recognition,” in Proc. ISMIR, 2019, pp. 83–90.
- “Cnn architectures for large-scale audio classification,” in Proc. ICASSP. IEEE, 2017, pp. 131–135.
- “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 2880–2894, 2020.
- “Efficient training of audio transformers with patchout,” in Proc. Interspeech. IEEE, 2022, pp. 2753–2757.
- “Byol for audio: Exploring pre-trained general-purpose audio representations,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 137–151, 2023.
- “Beats: Audio pre-training with acoustic tokenizers,” in Proc. ICML. PMLR, 2022, vol. 202, pp. 5178–5193.
- “Look, listen, and learn more: Design choices for deep audio embeddings,” in Proc. IEEE ICASSP. IEEE, 2019, pp. 3852–3856.
- “Contrastive learning of general-purpose audio representations,” in Proc. ICASSP. IEEE, 2021, pp. 3875–3879.
- “AST: audio spectrogram transformer,” in Proc. Interspeech. IEEE, 2021, pp. 571–575.
- “Hear: Holistic evaluation of audio representations,” in NeurIPS 2021 Competitions and Demonstrations Track. PMLR, 2022, vol. 176, pp. 125–145.
- “Towards learning universal audio representations,” in Proc. ICASSP. IEEE, 2022, pp. 4593–4597.
- “SUPERB: speech processing universal performance benchmark,” in Proc. INTERSPEECH. ISCA, 2021, pp. 1194–1198.
- “Learning general audio representations with large-scale training of patchout audio transformers,” in HEAR. PMLR, 2021, vol. 166, pp. 65–89.
- “Detection and classification of acoustic scenes and events: Outcome of the dcase 2016 challenge,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 26, pp. 379–393, 2018.
- “Openmic-2018: An open data-set for multiple instrument recognition,” in Proc. ISMIR, 2018, pp. 438–444.
- “A multi-device dataset for urban acoustic scene classification,” in Proc. DCASE, 2018, pp. 9–13.
- Karol J. Piczak, “ESC: dataset for environmental sound classification,” in Proc. Multimedia. ACM, 2015, pp. 1015–1018.
- “Adaptive pooling operators for weakly labeled sound event detection,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 26, pp. 2180–2193, Jul. 2018.
- “An attention-based approach to hierarchical multi-label music instrument classification,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “Bootstrap your own latent - A new approach to self-supervised learning,” in Proc. NeurIPS, 2020, pp. 6–12.
- “Training data-efficient image transformers & distillation through attention,” in Proc. ICML. PMLR, 2021, vol. 139, pp. 10347–10357.
- “FMA: A dataset for music analysis,” in Proc. ISMIR, 2017, pp. 316–323.
- “Receptive field regularization techniques for audio classification and tagging with deep convolutional neural networks,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 1987–2000, 2021.
- “Description and analysis of novelties introduced in DCASE task 4 2022 on the baseline system,” in Proc. DCASE, 2022, pp. 3–4.