Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems (2404.01616v3)

Published 2 Apr 2024 in cs.CL, cs.IR, cs.SD, and eess.AS

Abstract: LLMs are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn't require speech data during LLM pre-training and can exploit LLM's multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.

PDF HTML Abstract

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems

Introduction

Recent advancements have demonstrated the aptitude of LLMs in leveraging vast amounts of text data to perform remarkably across a multitude of linguistic tasks. However, their applications in speech technologies often face limitations due to the scarcity of languages supported. Addressing this gap, a novel approach is proposed that leverages LLMs' profound multilingual understanding capabilities to initialize multi-modal Dual Encoder (DE) retrieval systems, aiming to match speech and text across languages. This framework, remarkably, does not necessitate speech data during the LLMs’ pre-training phase, thereby presenting an innovative pathway to exploit LLMs for speech and text matching across an extensive array of languages.

Methodology

The core methodology encapsulates training a transformer-based DE model capable of encoding both speech and text. This entails transforming raw speech into discrete tokens through a pre-trained speech encoder, followed by a k-means clustering to generate audio tokens. The pivotal innovation lies in extending the LLMs’ embedding layer to support these audio tokens, effectively converting the LLM into a retrieval system through contrastive loss training. This methodological approach demonstrates a seamless integration of speech modality into the predominantly text-based framework of LLMs.

Experimental Design and Results

Speech-to-Text Retrieval Task

The evaluation on the speech-to-text retrieval task, leveraging the FLEURS dataset encompassing 102 languages, underscored the model's superior performance. With a 10% absolute improvement in Recall@1 averaged across languages, the model outstripped existing systems trained explicitly on the full spectrum of 102 languages. Such robust performance, achieved despite the model being trained on a mere subset of 21 languages, notably elucidates the model's exceptional capability to transcend linguistic barriers, significantly outperforming the baseline systems such as mSLAM across various metrics.

Cross-Modal and Cross-Lingual Translation Retrieval Task

In exploring cross-lingual speech-to-text translation retrieval, the model evidences remarkable zero-shot capabilities. This is enhanced through the incorporation of readily available machine translation data, with notable improvements in 4-gram corpusBLEU scores for languages including French, German, Dutch, and Polish. Such findings compellingly illustrate the model's adeptness in leveraging cross-lingual cues, thereby setting a precedent for the integration of machine translation data in bolstering cross-lingual retrieval performance.

Theoretical and Practical Implications

The research opens up avenues to significantly advance cross-modal and cross-lingual retrieval systems. Theoretically, it underscores the potential of leveraging text-centric LLMs to bridge modalities, offering a methodologically novel approach to dual-encoder training that encompasses multilingual capabilities. Practically, it posits a scalable solution to language disparity in speech technologies, providing a foundation for future endeavors aiming to broaden the linguistic accessibility of speech-based applications. Furthermore, the model's ability to incorporate translation data to enhance cross-lingual retrieval capabilities could pave the way for more enriched and linguistically inclusive speech technologies.

Conclusion

The innovation introduced in transforming LLMs into cross-modal and cross-lingual retrieval systems represents a significant stride towards overcoming the linguistic limitations inherent in current speech technologies. By leveraging the inherent linguistic diversity captured in LLMs, this work not only broadens the horizon for speech and text matching across a myriad languages but also exemplifies the potential of machine translation data in enriching the model's cross-lingual competencies. This sets a promising trajectory for the future development of speech technologies, emphasizing inclusivity and linguistic diversity.