Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring the Power of Pure Attention Mechanisms in Blind Room Parameter Estimation (2402.16003v2)

Published 25 Feb 2024 in eess.AS

Abstract: Dynamic parameterization of acoustic environments has drawn widespread attention in the field of audio processing. Precise representation of local room acoustic characteristics is crucial when designing audio filters for various audio rendering applications. Key parameters in this context include reverberation time (RT60) and geometric room volume. In recent years, neural networks have been extensively applied in the task of blind room parameter estimation. However, there remains a question of whether pure attention mechanisms can achieve superior performance in this task. To address this issue, this study employs blind room parameter estimation based on monaural noisy speech signals. Various model architectures are investigated, including a proposed attention-based model. This model is a convolution-free Audio Spectrogram Transformer, utilizing patch splitting, attention mechanisms, and cross-modality transfer learning from a pretrained Vision Transformer. Experimental results suggest that the proposed attention mechanism-based model, relying purely on attention mechanisms without using convolution, exhibits significantly improved performance across various room parameter estimation tasks, especially with the help of dedicated pretraining and data augmentation schemes. Additionally, the model demonstrates more advantageous adaptability and robustness when handling variable-length audio inputs compared to existing methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Wu, B., Li, K., Ge, F., Huang, Z., Yang, M., Siniscalchi, S.M., Lee, C.-H.: An end-to-end deep learning approach to simultaneous speech dereverberation and acoustic modeling for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing 11(8), 1289–1300 (2017) https://doi.org/10.1109/JSTSP.2017.2756439 Mohammadiha and Doclo [2016] Mohammadiha, N., Doclo, S.: Speech dereverberation using non-negative convolutive transfer function and spectro-temporal modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(2), 276–289 (2016) Cecchi et al. [2018] Cecchi, S., Carini, A., Spors, S.: Room response equalization – a review. Applied Sciences 8(1) (2018) Jin and Kleijn [2015] Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Mohammadiha, N., Doclo, S.: Speech dereverberation using non-negative convolutive transfer function and spectro-temporal modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(2), 276–289 (2016) Cecchi et al. [2018] Cecchi, S., Carini, A., Spors, S.: Room response equalization – a review. Applied Sciences 8(1) (2018) Jin and Kleijn [2015] Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Cecchi, S., Carini, A., Spors, S.: Room response equalization – a review. Applied Sciences 8(1) (2018) Jin and Kleijn [2015] Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  2. Mohammadiha, N., Doclo, S.: Speech dereverberation using non-negative convolutive transfer function and spectro-temporal modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(2), 276–289 (2016) Cecchi et al. [2018] Cecchi, S., Carini, A., Spors, S.: Room response equalization – a review. Applied Sciences 8(1) (2018) Jin and Kleijn [2015] Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Cecchi, S., Carini, A., Spors, S.: Room response equalization – a review. Applied Sciences 8(1) (2018) Jin and Kleijn [2015] Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  3. Cecchi, S., Carini, A., Spors, S.: Room response equalization – a review. Applied Sciences 8(1) (2018) Jin and Kleijn [2015] Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  4. Jin, W., Kleijn, W.B.: Theory and design of multizone soundfield reproduction using sparse methods. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12), 2343–2355 (2015) Jin [2016] Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  5. Jin, W.: Adaptive reverberation cancelation for multizone soundfield reproduction using sparse methods. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 509–513 (2016) Neidhardt et al. [2022] Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  6. Neidhardt, A., Schneiderwind, C., Klein, F.: Perceptual matching of room acoustics for auditory augmented reality in small rooms - literature review and theoretical framework. Trends in Hearing 26, 23312165221092919 (2022) Jot and Lee [2016] Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  7. Jot, J.-M., Lee, K.S.: Augmented reality headphone environment rendering. In: Audio Engineering Society Conference: 2016 AES International Conference on Audio for Virtual and Augmented Reality (2016). Audio Engineering Society Kuttruff [2016] Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  8. Kuttruff, H.: Room acoustics. In: CRC Press (2016) Eaton et al. [2016] Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  9. Eaton, J., Gaubitch, N.D., Moore, A.H., Naylor, P.A.: Estimation of room acoustic parameters: The ACE challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(10), 1681–1693 (2016) https://doi.org/10.1109/TASLP.2016.2577502 de M. Prego et al. [2015] M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  10. M. Prego, T., Lima, A.A., Zambrano-López, R., Netto, S.L.: Blind estimators for reverberation time and direct-to-reverberant energy ratio using subband speech decomposition. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2015). https://doi.org/10.1109/WASPAA.2015.7336954 Loellmann et al. [2015] Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  11. Loellmann, H., Brendel, A., Vary, P., Kellermann, W.: Single-channel maximum-likelihood T60 estimation exploiting subband information. arXiv preprint arXiv:1511.04063 (2015) Moore et al. [2014] Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  12. Moore, A., Brookes, M., Naylor, P.: Room identification using roomprints. Journal of the Audio Engineering Society (2014) Peters et al. [2012] Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  13. Peters, N., Lei, H., Friedland, G.: Name that room: Room identification using acoustic features in a recording. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 841–844 (2012) Gamper and Tashev [2018] Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  14. Gamper, H., Tashev, I.J.: Blind reverberation time estimation using a convolutional neural network. In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 136–140 (2018) Genovese et al. [2019] Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  15. Genovese, A.F., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.J.: Blind room volume estimation from single-channel noisy speech. In: Proc. ICASSP, pp. 231–235 (2019). IEEE Bryan [2020] Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  16. Bryan, N.J.: Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation. In: Proc. ICASSP, pp. 1–5 (2020). IEEE Götz et al. [2022] Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  17. Götz, P., Tuna, C., Walther, A., Habets, E.A.P.: Blind reverberation time estimation in dynamic acoustic conditions. In: Proc. ICASSP, pp. 581–585 (2022). IEEE Saini and Peissig [2023] Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  18. Saini, S., Peissig, J.: Blind room acoustic parameters estimation using mobile audio transformer. In: 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5 (2023) Srivastava et al. [2021] Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  19. Srivastava, P., Deleforge, A., Vincent, E.: Blind room parameter estimation using multiple multichannel speech recordings. In: 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 226–230 (2021) Christopher et al. [2023] Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  20. Christopher, I., Mehrabi, A., Jin, W.: Blind acoustic room parameter estimation using phase features. In: Proc. ICASSP, Rhodes Island, Greece, pp. 1–5 (2023). IEEE Callens and Cernak [2020] Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  21. Callens, P., Cernak, M.: Joint blind room acoustic characterization from speech and music signals using convolutional recurrent neural networks. arXiv preprint arXiv:2010.11167 (2020) Deng et al. [2020] Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  22. Deng, S., Mack, W., Habets, E.: Online blind reverberation time estimation using crnns. In: Interspeech (2020). https://api.semanticscholar.org/CorpusID:226202121 Kong et al. [2020] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  23. Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2880–2894 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  24. Gong, Y., Chung, Y.-A., Glass, J.: Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3292–3306 (2021) Li et al. [2018] Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  25. Li, P., Song, Y., McLoughlin, I.V., Guo, W., Dai, L.-R.: An attention pooling based representation learning method for speech emotion recognition (2018) Rybakov et al. [2020] Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  26. Rybakov, O., Kononenko, N., Subrahmanya, N., Visontai, M., Laurenzo, S.: Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720 (2020) Gong et al. [2021] Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  27. Gong, Y., Chung, Y.-A., Glass, J.: AST: Audio Spectrogram Transformer. In: Proc. Interspeech 2021, pp. 571–575 (2021) Wang et al. [2023] Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  28. Wang, C., Jia, M., Li, M., Bao, C., Jin, W.: Attention is all you need for blind room volume estimation. arXiv preprint arXiv:2309.13504 (2023) Jeub et al. [2009] Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  29. Jeub, M., Schafer, M., Vary, P.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. Digital Signal Processing, 1–5 (2009) Szöke et al. [2019] Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  30. Szöke, I., Skácel, M., Mošner, L., Paliesek, J., Černocký, J.: A binauralroom impulse response database for the evaluation of dereverberation algorithms. In: IEEE Journal of Selected Topics in Signal Processing, vol. 13, pp. 864–876 (2019) Stewart and Sandler [2010] Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  31. Stewart, R., Sandler, M.: Database of omnidirectional and b-format room impulse responses. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 13(4), 165–168 (2010) Carlo et al. [2021] Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  32. Carlo, D.D., Tandeitnik, P., Foy, C., Bertin, N., Deleforge, A., Gannot, S.: dechorate: a calibrated room impulse response dataset for echo-aware signal processing. EURASIP Journal on Audio, Speech, and Music Processing, Springer 2021(1), 1–15 (2021) Murphy and Shelley [2010] Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  33. Murphy, D., Shelley, S.: Openair: an interactive auralization web resource and database. Journal of the Audio Engineering Society (2010) Schroeder [1965] Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  34. Schroeder, M.R.: New method of measuring reverberation time. The Journal of the Acoustical Society of America 37(3), 409–412 (1965) https://doi.org/10.1121/1.1909343 https://doi.org/10.1121/1.1909343 Scheibler et al. [2017] Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  35. Scheibler, R., Bezzam, E., Dokmanić, I.: Pyroomacoustics: A Python package for audio room simulations and array processing algorithms. arXiv e-prints, 1710–04196 (2017) https://doi.org/10.48550/arXiv.1710.04196 arXiv:1710.04196 [cs.SD] Krishnamurthy and Hansen [2009] Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  36. Krishnamurthy, N., Hansen, J.H.L.: Babble noise: Modeling, analysis, and applications. IEEE Transactions on Audio, Speech, and Language Processing 17(7), 1394–1407 (2009) https://doi.org/10.1109/TASL.2009.2015084 Park et al. [2019] Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  37. Park, D.S., W. Chan, Y.Z., Chiu, C.-C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech, 2613–2617 (2019) Srivastava et al. [2022] Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  38. Srivastava, P., Deleforge, A., Vincent, E.: Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators. arXiv preprint arXiv:2207.09133 (2022) Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  39. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., N.Gomez, A., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS (2017) Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  40. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  41. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2021). PMLR Sun et al. [2021] Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  42. Sun, H., Liu, X., Xu, K., Miao, J., Luo, Q.: Emergency vehicles audio detection and localization in autonomous driving. arXiv preprint arXiv:2109.14797 (2021) Gwardys and Grzywczak [2014] Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  43. Gwardys, G., Grzywczak, D.: Deep image features in music information retrieval. International Journal of Electronics and Telecommunications 60, 321–326 (2014) Guzhov et al. [2021] Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  44. Guzhov, A., Raue, F., Hees, J., Dengel, A.: Esrsuppesnet: Environmental sound classification based on visual domain models. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4933–4940 (2021). IEEE He et al. [2019] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  45. He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4918–4927 (2019) Tan and Le [2019] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
  46. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com