Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems (2404.01616v3)

Published 2 Apr 2024 in cs.CL, cs.IR, cs.SD, and eess.AS

Abstract: LLMs are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn't require speech data during LLM pre-training and can exploit LLM's multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems

Introduction

Recent advancements have demonstrated the aptitude of LLMs in leveraging vast amounts of text data to perform remarkably across a multitude of linguistic tasks. However, their applications in speech technologies often face limitations due to the scarcity of languages supported. Addressing this gap, a novel approach is proposed that leverages LLMs' profound multilingual understanding capabilities to initialize multi-modal Dual Encoder (DE) retrieval systems, aiming to match speech and text across languages. This framework, remarkably, does not necessitate speech data during the LLMs’ pre-training phase, thereby presenting an innovative pathway to exploit LLMs for speech and text matching across an extensive array of languages.

Methodology

The core methodology encapsulates training a transformer-based DE model capable of encoding both speech and text. This entails transforming raw speech into discrete tokens through a pre-trained speech encoder, followed by a k-means clustering to generate audio tokens. The pivotal innovation lies in extending the LLMs’ embedding layer to support these audio tokens, effectively converting the LLM into a retrieval system through contrastive loss training. This methodological approach demonstrates a seamless integration of speech modality into the predominantly text-based framework of LLMs.

Experimental Design and Results

Speech-to-Text Retrieval Task

The evaluation on the speech-to-text retrieval task, leveraging the FLEURS dataset encompassing 102 languages, underscored the model's superior performance. With a 10% absolute improvement in Recall@1 averaged across languages, the model outstripped existing systems trained explicitly on the full spectrum of 102 languages. Such robust performance, achieved despite the model being trained on a mere subset of 21 languages, notably elucidates the model's exceptional capability to transcend linguistic barriers, significantly outperforming the baseline systems such as mSLAM across various metrics.

Cross-Modal and Cross-Lingual Translation Retrieval Task

In exploring cross-lingual speech-to-text translation retrieval, the model evidences remarkable zero-shot capabilities. This is enhanced through the incorporation of readily available machine translation data, with notable improvements in 4-gram corpusBLEU scores for languages including French, German, Dutch, and Polish. Such findings compellingly illustrate the model's adeptness in leveraging cross-lingual cues, thereby setting a precedent for the integration of machine translation data in bolstering cross-lingual retrieval performance.

Theoretical and Practical Implications

The research opens up avenues to significantly advance cross-modal and cross-lingual retrieval systems. Theoretically, it underscores the potential of leveraging text-centric LLMs to bridge modalities, offering a methodologically novel approach to dual-encoder training that encompasses multilingual capabilities. Practically, it posits a scalable solution to language disparity in speech technologies, providing a foundation for future endeavors aiming to broaden the linguistic accessibility of speech-based applications. Furthermore, the model's ability to incorporate translation data to enhance cross-lingual retrieval capabilities could pave the way for more enriched and linguistically inclusive speech technologies.

Conclusion

The innovation introduced in transforming LLMs into cross-modal and cross-lingual retrieval systems represents a significant stride towards overcoming the linguistic limitations inherent in current speech technologies. By leveraging the inherent linguistic diversity captured in LLMs, this work not only broadens the horizon for speech and text matching across a myriad languages but also exemplifies the potential of machine translation data in enriching the model's cross-lingual competencies. This sets a promising trajectory for the future development of speech technologies, emphasizing inclusivity and linguistic diversity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  3. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374.
  4. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  7. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE.
  8. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE.
  9. Speechmatrix: A large-scale mined corpus of multilingual speech-to-speech translations. arXiv preprint arXiv:2211.04508.
  10. Multimodal and multilingual embeddings for large-scale speech mining. Advances in Neural Information Processing Systems, 34:15748–15761.
  11. Sentence-level multimodal and language-agnostic representations. arXiv preprint arXiv:2308.11466.
  12. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  13. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  14. Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2:1735–1742.
  15. Textually pretrained speech language models. arXiv preprint arXiv:2305.13009.
  16. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  17. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  18. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
  19. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  20. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  21. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, Dublin, Ireland. Association for Computational Linguistics.
  22. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  23. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
  24. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
  25. Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. ArXiv, abs/1907.05791.
  26. Towards the next 1000 languages in multilingual machine translation: Exploring the synergy between supervised and self-supervised learning. arXiv preprint arXiv:2201.03110.
  27. Covost 2 and massively multilingual speech translation. In Interspeech.
  28. Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310.
  29. Neural codec language models are zero-shot text to speech synthesizers.
  30. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  31. Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. arXiv preprint arXiv:1902.08564.
  32. Learning spread-out local feature descriptors. 2017 IEEE International Conference on Computer Vision (ICCV), pages 4605–4613.
  33. Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Frank Palma Gomez (3 papers)
  2. Ramon Sanabria (22 papers)
  3. Daniel Cer (28 papers)
  4. Siddharth Dalmia (36 papers)
  5. Gustavo Hernandez Abrego (12 papers)
  6. Yun-Hsuan Sung (18 papers)
Citations (2)