Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models (2409.10999v1)

Published 17 Sep 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Audio LLMs can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio LLMs are mostly initialized from pre-trained audio encoders and LLMs. Although these pre-trained components were developed to support multiple languages, audio-LLMs are trained predominantly on English data, which may limit their usability to only English instructions or English speech inputs. First, this paper examines the performance of existing audio LLMs in an underserved language using Thai as an example. This paper demonstrates that, despite being built on multilingual backbones, audio LLMs do not exhibit cross-lingual emergent abilities to low-resource languages. Second, this paper studies data mixture for developing audio LLMs that are optimized for a target language as well as English. In addition. this paper integrates audio comprehension and speech instruction-following capabilities into a single unified model. Our experiments provide insights into data mixture for enhancing instruction-following capabilities in both a low-resource language and English. Our model, Typhoon-Audio, outperforms existing open-source audio LLMs by a considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models (2409.10999v1)

Summary

Related Papers

Tweets