Evaluate LLM2Vec in Non-English Languages
Investigate the performance of the LLM2Vec procedure—enabling bidirectional attention, masked next token prediction, and unsupervised SimCSE—on languages other than English by applying it to non-English text corpora and evaluating on appropriate multilingual or language-specific embedding benchmarks.
Sponsor
References
We leave it to future work to investigate the performance of LLM2Vec on other languages.
— LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
(2404.05961 - BehnamGhader et al., 9 Apr 2024) in Appendix, Section "Limitations" (Extending to other languages)