Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment

Published 7 Apr 2026 in cs.IR | (2604.05684v1)

Abstract: With the increasing accessibility and utilization of multilingual documents, Cross-Lingual Information Retrieval (CLIR) has emerged as an important research area. Conventionally, CLIR tasks have been conducted under settings where the language of documents differs from that of queries, and typically, the documents are composed in a single coherent language. In this paper, we highlight that in such a setting, the cross-lingual alignment capability may not be evaluated adequately. Specifically, we observe that, in a document pool where English documents coexist with another language, most multilingual retrievers tend to prioritize unrelated English documents over the related document written in the same language as the query. To rigorously analyze and quantify this phenomenon, we introduce various scenarios and metrics designed to evaluate the cross-lingual alignment performance of multilingual retrieval models. Furthermore, to improve cross-lingual performance under these challenging conditions, we propose a novel training strategy aimed at enhancing cross-lingual alignment. Using only a small dataset consisting of 2.8k samples, our method significantly improves the cross-lingual retrieval performance while simultaneously mitigating the English inclination problem. Extensive analyses demonstrate that the proposed method substantially enhances the cross-lingual alignment capabilities of most multilingual embedding models.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces Max@R metric to better evaluate cross-lingual alignment capabilities.
A combined training strategy with JSD and InfoNCE losses improves semantic proximity across languages.
Significant performance gains and reduced language biases highlight a robust multilingual approach.

Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment

Problem Statement and Motivation

The paper addresses severe misalignment and bias challenges inherent in current multilingual embedding models for Cross-Lingual Information Retrieval (CLIR). While conventional CLIR evaluation settings focus on retrieving documents written in a language different from the query, they typically assume monolingual document pools and fail to assess performance in realistic scenarios where documents in multiple languages coexist. The study demonstrates that most state-of-the-art multilingual retrievers systematically prioritize unrelated English documents over semantically relevant documents in the query’s language, exposing prominent biases and ineffective cross-lingual semantic alignment.

Crucially, the paper observes that similarity-based retrieval metrics such as MAP, MRR, or NDCG@k are insufficient for diagnosing cross-lingual misalignment. There is strong empirical evidence of significant performance degradation and inconsistent retrieval behavior when the query and relevant documents are written in non-English languages, further underlining the urgency of addressing representational misalignment and language bias.

Novelty: Scenario, Metric, and Method

The authors introduce a realistic multi-reference cross-lingual retrieval scenario—where English and a target language documents coexist and each query has parallel relevant documents in both languages—to rigorously evaluate cross-lingual alignment capabilities. To enable effective diagnosis under this scenario, they propose Max@R, a new evaluation metric capturing the worst (highest) retrieval rank needed to recover all parallel ground-truth documents per query. Max@R is inherently more sensitive to alignment errors, offering diagnostic capabilities for both semantic proximity and bias. They further introduce Max@Rnorm for cross-dataset normalization.

To address the observed misalignments, the paper proposes a unified training strategy combining:

Jensen-Shannon Divergence (JSD) based embedding alignment loss for minimizing distributional distance between parallel multilingual document embeddings.
InfoNCE contrastive loss to maximize semantic similarity between cross-lingual query-document pairs within a shared embedding space.

The loss encourages both distributional (dimension-wise) and instance-based alignment, compelling the learned representation space to be robustly language-agnostic and semantically faithful.

Empirical Results

Experimental Design

The approach is evaluated using fully-parallel multilingual datasets (XQuAD, Belebele), with three core retrieval scenarios:

Multi: Retrieve two ground-truth documents per query (English and target language).
Multi-1: Retrieve only the cross-lingual (opposite language) relevant document per query.
Mono: Monolingual retrieval (same or different query/document languages, the latter matching standard CLIR).

Four strong embedding baselines are considered: multilingual-E5, GTE-multilingual, jina-embeddings-v3, and BGE-m3. The small training set (2.8k parallel samples per language) is created by translating English positive documents via GPT-4o.

Quantitative Gains

The proposed training strategy yields consistently lower Max@R and higher Complete@10 scores across languages and baselines. For instance, for the multilingual-E5 model with Chinese queries, Max@R is reduced from 650.95 (baseline) to 23.10 (Ours) on XQuAD, representing an over 28x improvement, directly quantifying improved semantic proximity and cross-lingual alignment.

Language bias, particularly the English inclination, is substantially mitigated. The disparity between English/target language performance for jina-embeddings-v3 (En+Zh, Complete@10) decreases from 6.89%p (baseline) to 1.77%p (Ours) on XQuAD and from 4.45%p to 0.12%p on Belebele, highlighting a more equitable retrieval landscape. In stricter Multi-1 and monolingual scenarios, the approach consistently outperforms baselines on NDCG@1 and MRR, even slightly improving monolingual retrieval performance.

Ablation Study

Ablation verifies the complementary roles of the JSD alignment and InfoNCE components: omitting either leads to clear performance drops in cross-lingual alignment or retrieval ability. The results also show that simply maximizing cross-lingual cosine similarity (as with LNCEpsg) is clearly inferior to distribution-level alignment, as evidenced by the persistent gaps in Max@Rnorm.

Theoretical and Practical Implications

The work presents strong evidence that conventional CLIR evaluation protocols are insufficient: models excelling in these settings can still exhibit drastic misalignment, language dependency, and inefficiency when documents are mixed-language—an increasingly prevalent real-world setting.

Employing both distribution-level and instance-level constraints during training achieves a tighter cross-lingual semantic coupling, substantiated by robust gains on both standard and novel metrics. This result implies that embedding distribution geometry across languages must be actively regularized for truly language-agnostic IR—mere pairwise similarity maximization is inadequate.

The experiments also validate that strong cross-lingual alignment is not antagonistic to monolingual performance, contrary to common concerns regarding catastrophic forgetting in multilingual tuning.

Implications for Future AI Research

This research exposes critical limitations in evaluating and training multilingual IR systems and provides concrete diagnostic tools and remedies. Future work should:

Extend diagnostic settings to more diverse multi-language pools and explore cross-lingual retrieval in settings involving more than two languages.
Investigate potential side effects of heavy reliance on machine-translated training data, particularly with respect to cultural nuance and emerging language-specific biases.
Adapt the combined loss framework for scale, e.g., to the entire MIRACL benchmark, and integrate alignment-aware fine-tuning in low-resource and domain-specific retrieval.
Explore applications of distributional alignment in other cross-lingual tasks beyond IR, such as generation, summarization, or robust question-answering.

Conclusion

The paper provides a rigorous analysis of cross-lingual retrieval limitations, introduces a diagnostic multi-reference scenario and metric, and proposes a principled training approach that achieves substantial improvements in semantic alignment and language bias mitigation. The method is empirically validated across strong baselines and datasets, offering a new standard for evaluating and engineering robust multilingual IR systems. The study suggests theoretical refinements and practical training guidelines with significant implications for future research on fair, reliable, and language-inclusive information access.

Reference: "Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment" (2604.05684)

Markdown Report Issue