Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AudioLog: LLMs-Powered Long Audio Logging with Hybrid Token-Semantic Contrastive Learning (2311.12371v2)

Published 21 Nov 2023 in eess.AS

Abstract: Previous studies in automated audio captioning have faced difficulties in accurately capturing the complete temporal details of acoustic scenes and events within long audio sequences. This paper presents AudioLog, a LLMs-powered audio logging system with hybrid token-semantic contrastive learning. Specifically, we propose to fine-tune the pre-trained hierarchical token-semantic audio Transformer by incorporating contrastive learning between hybrid acoustic representations. We then leverage LLMs to generate audio logs that summarize textual descriptions of the acoustic environment. Finally, we evaluate the AudioLog system on two datasets with both scene and event annotations. Experiments show that the proposed system achieves exceptional performance in acoustic scene classification and sound event detection, surpassing existing methods in the field. Further analysis of the prompts to LLMs demonstrates that AudioLog can effectively summarize long audio sequences. To the best of our knowledge, this approach is the first attempt to leverage LLMs for summarizing long audio sequences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Computational analysis of sound scenes and events, Springer, 2018.
  2. “Multimodal urban sound tagging with spatiotemporal context,” IEEE Transactions on Cognitive and Developmental Systems, vol. 15, no. 2, pp. 555–565, 2023.
  3. “Sound event detection: A tutorial,” IEEE Signal Processing Magazine, vol. 38, no. 5, pp. 67–83, 2021.
  4. “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
  5. “A comprehensive survey of automated audio captioning,” arXiv preprint arXiv:2205.05357, 2022.
  6. “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  7. “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  8. “Automated audio captioning: an overview of recent progress and new challenges,” EURASIP journal on audio, speech, and music processing, vol. 2022, no. 1, pp. 1–18, 2022.
  9. “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
  10. “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650.
  11. “Sound event detection by multitask learning of sound events and scenes with soft scene labels,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 621–625.
  12. “A squeeze-and-excitation and transformer based cross-task model for environmental sound recognition,” IEEE Transactions on Cognitive and Developmental Systems, pp. 1–1, 2022.
  13. “Clar: Contrastive learning of auditory representations,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 2530–2538.
  14. Luyu Wang and Aaron van den Oord, “Multi-format contrastive learning of audio representations,” arXiv preprint arXiv:2103.06508, 2021.
  15. “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  16. “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
  17. “Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 902–914, 2023.
  18. “Strong labeling of sound events using crowdsourced weak labels and annotator competence estimation,” in Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 2023.
  19. “How information on acoustic scenes and sound events mutually benefits event detection and scene classification tasks,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2022, pp. 7–11.
  20. “DCASE 2023 challenge task4 technical report,” Tech. Rep., DCASE2023 Challenge, May 2023.
  21. “How information on soft labels and hard labels mutually benefits sound event detection tasks,” Tech. Rep., DCASE2023 Challenge, May 2023.
  22. “Joint analysis of acoustic scenes and sound events based on multitask learning with dynamic weight adaptation,” Acoustical Science and Technology, vol. 44, no. 3, pp. 167–175, 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.