Evaluate LLM2Vec in Non-English Languages

Investigate the performance of the LLM2Vec procedure—enabling bidirectional attention, masked next token prediction, and unsupervised SimCSE—on languages other than English by applying it to non-English text corpora and evaluating on appropriate multilingual or language-specific embedding benchmarks.

Background

The methodology proposed in the paper is described as language-agnostic, but all experiments and evaluations were conducted on English corpora and benchmarks.

The authors explicitly defer exploration of non-English settings, making cross-lingual validation of LLM2Vec’s effectiveness an open direction.

References

We leave it to future work to investigate the performance of LLM2Vec on other languages.

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (2404.05961 - BehnamGhader et al., 9 Apr 2024) in Appendix, Section "Limitations" (Extending to other languages)