Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems
Introduction
Recent advancements have demonstrated the aptitude of LLMs in leveraging vast amounts of text data to perform remarkably across a multitude of linguistic tasks. However, their applications in speech technologies often face limitations due to the scarcity of languages supported. Addressing this gap, a novel approach is proposed that leverages LLMs' profound multilingual understanding capabilities to initialize multi-modal Dual Encoder (DE) retrieval systems, aiming to match speech and text across languages. This framework, remarkably, does not necessitate speech data during the LLMs’ pre-training phase, thereby presenting an innovative pathway to exploit LLMs for speech and text matching across an extensive array of languages.
Methodology
The core methodology encapsulates training a transformer-based DE model capable of encoding both speech and text. This entails transforming raw speech into discrete tokens through a pre-trained speech encoder, followed by a k-means clustering to generate audio tokens. The pivotal innovation lies in extending the LLMs’ embedding layer to support these audio tokens, effectively converting the LLM into a retrieval system through contrastive loss training. This methodological approach demonstrates a seamless integration of speech modality into the predominantly text-based framework of LLMs.
Experimental Design and Results
Speech-to-Text Retrieval Task
The evaluation on the speech-to-text retrieval task, leveraging the FLEURS dataset encompassing 102 languages, underscored the model's superior performance. With a 10% absolute improvement in Recall@1 averaged across languages, the model outstripped existing systems trained explicitly on the full spectrum of 102 languages. Such robust performance, achieved despite the model being trained on a mere subset of 21 languages, notably elucidates the model's exceptional capability to transcend linguistic barriers, significantly outperforming the baseline systems such as mSLAM across various metrics.
Cross-Modal and Cross-Lingual Translation Retrieval Task
In exploring cross-lingual speech-to-text translation retrieval, the model evidences remarkable zero-shot capabilities. This is enhanced through the incorporation of readily available machine translation data, with notable improvements in 4-gram corpusBLEU scores for languages including French, German, Dutch, and Polish. Such findings compellingly illustrate the model's adeptness in leveraging cross-lingual cues, thereby setting a precedent for the integration of machine translation data in bolstering cross-lingual retrieval performance.
Theoretical and Practical Implications
The research opens up avenues to significantly advance cross-modal and cross-lingual retrieval systems. Theoretically, it underscores the potential of leveraging text-centric LLMs to bridge modalities, offering a methodologically novel approach to dual-encoder training that encompasses multilingual capabilities. Practically, it posits a scalable solution to language disparity in speech technologies, providing a foundation for future endeavors aiming to broaden the linguistic accessibility of speech-based applications. Furthermore, the model's ability to incorporate translation data to enhance cross-lingual retrieval capabilities could pave the way for more enriched and linguistically inclusive speech technologies.
Conclusion
The innovation introduced in transforming LLMs into cross-modal and cross-lingual retrieval systems represents a significant stride towards overcoming the linguistic limitations inherent in current speech technologies. By leveraging the inherent linguistic diversity captured in LLMs, this work not only broadens the horizon for speech and text matching across a myriad languages but also exemplifies the potential of machine translation data in enriching the model's cross-lingual competencies. This sets a promising trajectory for the future development of speech technologies, emphasizing inclusivity and linguistic diversity.