Can Large Language Models Understand Spatial Audio? (2406.07914v2)
Abstract: This paper explores enabling LLMs to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and localisation-informed speech extraction (LSE), achieving notable progress in each task. For SSL, our approach achieves an MAE of $2.70{\circ}$ on the Spatial LibriSpeech dataset, substantially surpassing the prior benchmark of about $6.60{\circ}$. Moreover, our model can employ spatial cues to improve FSR accuracy and execute LSE by selectively attending to sounds originating from a specified direction via text prompts, even amidst overlapping speech. These findings highlight the potential of adapting LLMs to grasp physical audio concepts, paving the way for LLM-based agents in 3D environments.
- Changli Tang (15 papers)
- Wenyi Yu (14 papers)
- Guangzhi Sun (51 papers)
- Xianzhao Chen (10 papers)
- Tian Tan (21 papers)
- Wei Li (1121 papers)
- Jun Zhang (1008 papers)
- Lu Lu (189 papers)
- Zejun Ma (78 papers)
- Yuxuan Wang (239 papers)
- Chao Zhang (907 papers)