Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models (2409.10999v1)

Published 17 Sep 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: Audio LLMs can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio LLMs are mostly initialized from pre-trained audio encoders and LLMs. Although these pre-trained components were developed to support multiple languages, audio-LLMs are trained predominantly on English data, which may limit their usability to only English instructions or English speech inputs. First, this paper examines the performance of existing audio LLMs in an underserved language using Thai as an example. This paper demonstrates that, despite being built on multilingual backbones, audio LLMs do not exhibit cross-lingual emergent abilities to low-resource languages. Second, this paper studies data mixture for developing audio LLMs that are optimized for a target language as well as English. In addition. this paper integrates audio comprehension and speech instruction-following capabilities into a single unified model. Our experiments provide insights into data mixture for enhancing instruction-following capabilities in both a low-resource language and English. Our model, Typhoon-Audio, outperforms existing open-source audio LLMs by a considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.

Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets