Papers
Topics
Authors
Recent
2000 character limit reached

Cross-Lingual Query Expansion

Updated 1 December 2025
  • Cross-lingual query expansion is a set of methods that generate semantically related terms in target languages to improve retrieval across language barriers.
  • Techniques include translation-based, embedding-alignment, and generative LLM-driven methods, each balancing precision, resource needs, and handling ambiguities.
  • Empirical results show significant gains in MAP, Recall, and nDCG, with hybrid pipelines proving effective for diverse, low-resource, and non-Latin script languages.

Cross-lingual query expansion is a collection of methodologies that reformulate a user's information need—expressed as a query in a source language—by generating semantically related or diversified expansions in one or more target languages to improve the coverage and effectiveness of cross-language information retrieval (CLIR) systems. These techniques address the terminological and representational mismatch arising from linguistic diversity, enabling retrieval of relevant documents written in a language different from the user's query. State-of-the-art approaches now span translation-based term expansion, dense cross-lingual embedding alignment, and generative methods leveraging multilingual LLMs (mLLMs), each with characteristic strengths, deployment considerations, and persistent challenges (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025, Ravichandran et al., 21 Feb 2024, Zhuang et al., 2023, Rahman et al., 2019).

1. Taxonomy of Cross-Lingual Query Expansion Methods

Cross-lingual query expansion has evolved through several methodological generations (Goworek et al., 1 Oct 2025):

a. Translation-Based Expansion

  • Dictionary-based and Statistical MT: Each source-language query term s∈Qs \in Q is expanded by looking up all candidate translations T(s)T(s) in a bilingual dictionary or applying statistical translation models (e.g., IBM Model 1), resulting in weighted target-language queries:

    Q′=⋃s∈QT(s)orQ′={(t,ws→t):s∈Q,t∈T(s)}Q' = \bigcup_{s\in Q} T(s) \quad\text{or}\quad Q' = \{(t, w_{s\to t}): s \in Q, t \in T(s)\}

    Strengths include minimal corpus requirements; weaknesses involve ambiguity and low coverage for named or domain-specific entities.

  • Neural Machine Translation (NMT): Encoder–decoder models integrate pre-expansion (via kk-best outputs) and post-expansion (synonyms or back-translation). NMT enables more fluent and context-appropriate expansions but introduces resource intensity and hallucination risk.
  • Pivot-based and Back-Translation: Indirect translation via high-resource intermediary languages or multiple pivots can reduce ambiguity, but increases noise.

b. Embedding-Based Expansion

  • Cross-lingual Embedding Alignment: Joint pre-training (e.g., mBERT, XLM-R) or supervised mapping aligns language spaces, enabling cross-lingual nearest neighbor expansion. For each query term, retrieve its kk closest target-language embeddings, weighted by similarity:

    NN(s)=arg top-kt  SIM(es,et),Q′=Q∪⋃s∈QNN(s)NN(s) = \text{arg\,top-}k_t\;\text{SIM}(e_s, e_t), \quad Q' = Q \cup \bigcup_{s\in Q} NN(s)

    Embedding-based approaches are robust to surface variation, support nearest-neighbor search over dense or hybrid representations, and are effective for morphologically rich languages.

c. Generative LLM-Driven Expansion

  • Prompted Generation: Multilingual LLMs are prompted in zero-shot, few-shot, or chain-of-thought settings to generate either term lists or pseudo-documents in the target language. The output may be concatenated to the translated query or used to form an expanded representation for dense retrieval. Expansion via pseudo-documents bridges the short query–long document gap and provides contextually rich, language-aligned expansions.
  • Supervised and RLHF Fine-Tuning: Task-specific fine-tuning on (query, expansion) pairs, and optimization with retrieval-based reward models (e.g., via PPO), have demonstrated additional—but format-dependent—gains.
  • Hybrid Methods: Combining translation, embedding, and generative approaches, often via multi-stage or reciprocal rank fusion, achieves state-of-the-art effectiveness (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025, Zhuang et al., 2023, Ravichandran et al., 21 Feb 2024, Rahman et al., 2019).

2. Formal Models and Expansion Pipelines

Cross-lingual query expansion paradigms are implemented as modular pipelines. Canonical architectures are as follows:

Methodology Expansion Step Scoring Function
Translation (dictionary/MT) Qs⟶TQtQ_s \overset{T}{\longrightarrow} Q_t; Q′=Qt∪T(Qs)Q' = Q_t \cup T(Q_s) BM25(Q′,d)\text{BM25}(Q', d); Match via n-gram
Embedding-based es→NNk(es)e_s \rightarrow NN_k(e_s) in LtL_t space Cosine/dot product with doc embeddings
Generative LLMs Qt→LLMEQ_t \xrightarrow{\text{LLM}} E; Q⋆=Qt∪EQ^\star = Q_t \cup E Sparse/dense: BM25(Q⋆,d)\text{BM25}(Q^\star, d) or f(Q⋆)⊤g(d)f(Q^\star)^\top g(d)

(Relevant details: (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025, Ravichandran et al., 21 Feb 2024))

Recent frameworks augment these core pipelines with multi-level expansions:

  • Translation via multiple intermediate languages (back-translation), semantic embedding expansion (nearest neighbor terms), and user-profile centered augmentation—final term weights fuse base expansion score and user affinity (Ravichandran et al., 21 Feb 2024).
  • Generative pseudo-document expansion using multilingual LLMs (zero-shot or few-shot prompting), with the choice of prompt and expansion method controlled by input query length (Macmillan-Scott et al., 24 Nov 2025).
  • Rocchio-style passage representation augmentation via generated queries in multiple languages for dense retrieval, with all augmentation performed at indexing time (Zhuang et al., 2023).

3. Experimental Protocols and Empirical Findings

Performance assessment in cross-lingual query expansion employs retrieval metrics such as Recall@K, Hit@K, Mean Average Precision (MAP), and nDCG@K (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025). Benchmark datasets include CLIRMatrix, mMARCO, XOR-TyDi, and a broad spectrum of news, QA, and social media corpora covering over 40 languages.

Key empirical trends:

  • Translation-based QE typically yields +8–15% MAP over unexpanded baselines on high-resource languages, but underperforms on distant/low-resource pairs (Goworek et al., 1 Oct 2025).
  • Embedding-based QE consistently produces +10–20% MAP and +5–10% nDCG@20 increases and degrades gracefully under typological distance.
  • LLM-based generative QE achieves up to +25% MAP and up to +15% absolute Recall increases in large-scale CLIR datasets, with chain-of-thought prompting and hybrid translation-generation frameworks yielding the best effectiveness (Macmillan-Scott et al., 24 Nov 2025).
  • Query expansion by word embeddings, DBpedia concept linking, and hypernymy collectively leads to significant MAP gains (e.g., 0.7143→0.8200 monolingual; 0.6757→0.7829 CLIR) on question re-ranking tasks (Rahman et al., 2019).
  • Multi-level translation and embedding expansion combined with user-profile weighting achieves up to +6.8% rel. ROUGE-L F1 over strong BM25 baselines on both news and Twitter datasets (Ravichandran et al., 21 Feb 2024).
  • Augmenting passage representations with cross-lingual query generation at indexing time yields statistically significant improvements in Recall@m k tokens (+2.1–3.2 points) for dense retrieval (Zhuang et al., 2023).

Effectiveness is sensitive to the following:

  • Query length: On CLIRMatrix (short queries), zero-shot prompting is optimal; on mMARCO (longer queries), few-shot prompting excels (Macmillan-Scott et al., 24 Nov 2025).
  • Script and resource disparities: Relative gains are highest for low-baseline language pairs, but retrieval remains weakest for non-Latin scripts.
  • Supervised fine-tuning: Only beneficial when train/test data format and genre match; otherwise, may reduce performance (Macmillan-Scott et al., 24 Nov 2025).

4. Challenges and Failure Modes

Persistent technical challenges include (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025, Ravichandran et al., 21 Feb 2024, Rahman et al., 2019):

  • Data imbalance and low-resource language coverage: Scarcity of high-quality bilingual dictionaries or parallel data limits translation-based QE; unsupervised embedding alignment degrades for distant or noisy languages.
  • Semantic drift and polysemy: Short queries increase expansion ambiguity and risk off-topic terms. Expansion-only methods (WE/DB/HN sans original keywords) can reduce retrieval precision (Rahman et al., 2019).
  • LLM hallucination: Generative expansion methods can introduce invalid terms, particularly under few-shot templates for underrepresented scripts.
  • User-centric adaptation: Unpersonalized expansions may miss idiosyncratic intent; profile-weighted terms improve recall but require user history (Ravichandran et al., 21 Feb 2024).
  • Systematic script-based disparities: Absolute performance remains lower for Chinese, Japanese, Arabic, and Hindi, even with expansion; these pairs benefit most in relative terms (Macmillan-Scott et al., 24 Nov 2025).

Mitigation strategies:

  • Back-translation and multi-pivot expansion to increase terminological diversity, while monitoring for increased semantic drift.
  • Consistency verification by back-translation or semantic entailment classifiers to filter hallucinated or off-topic expansions (Goworek et al., 1 Oct 2025).
  • Hybrid pipelines combining sparse, dense, and generative expansions at query or indexing time.

5. Recent Advances and Hybrid Architectures

The current frontier of cross-lingual query expansion features (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025, Zhuang et al., 2023, Ravichandran et al., 21 Feb 2024):

  • Modular hybrid pipelines: Compositional expansion strategies—concatenating translation-, embedding-, and generative-outputs—improve both recall and early precision.
  • Index-time passage augmentation: Generative query expansion applied to passage representations (not queries) retrofits any cross-lingual dense retriever infrastructure and improves language alignment without increasing query latency (Zhuang et al., 2023).
  • Personalized retrieval: Expansion terms selected or weighted according to user profile vector affinities, increasing relevance for individual users (Ravichandran et al., 21 Feb 2024).
  • Prompt optimization: Prompt complexity, number of in-context examples, and expansion-after/before-translation selection are tailored to the query length and resource profile (Macmillan-Scott et al., 24 Nov 2025).

Examples of LLM-based expansion pipelines:

Expansion Strategy Key Steps Most Effective When
Zero-shot LLM pseudo-document Translate → Prompted expansion → Retrieval (BM25/dense) Short queries
Few-shot in-context LLM expansion Translate → Few-shot prompt expansion → Q+E concatenation Longer queries
Embedding-informed user profile QE Back-translate, expand neighbors, profile weighting, re-rank Personalized retrieval
Index-time xQG (multi-language) Generate queries per language per passage, aggregate at index Dense IR, multi-lingual

6. Evaluation Benchmarks, Metrics, and Comparative Results

Extensive benchmark datasets include CLIRMatrix (139×138 language pairs), XOR-TyDi, LaMP, Sentiment140, AfriCLIRMatrix, MIRACL, and NeuMARCO/mMARCO (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025, Zhuang et al., 2023, Ravichandran et al., 21 Feb 2024).

Principal evaluation measures:

  • Recall@k: Coverage of relevant documents in top-k
  • Hit@k: Fraction of queries with ≥1 relevant hit in top-k
  • MAP, nDCG@k, MRR: Early precision and graded relevance

Comparative findings:

  • Dictionary/SMT-based QE outperformed unexpanded baselines by +8–15% MAP but is eclipsed by embedding and LLM-based methods in dense, multilingual data (Goworek et al., 1 Oct 2025).
  • Embedding-based QE sustained +10–20% MAP and +5–10% nDCG@20 on multilingual Wikipedia and QA datasets.
  • LLM-based generative expansion increased MAP and Recall@k by 15–25% and 10–15% respectively, with the largest absolute improvements observed for low-resource and non-Latin-script languages, albeit with the lowest baselines (Macmillan-Scott et al., 24 Nov 2025).

A detailed ablation on question re-ranking tasks revealed that only the union of expansion sources with original keywords consistently improves MAP; single-source expansions without original keywords decrease performance, emphasizing the importance of robust term weighting and context incorporation (Rahman et al., 2019).

7. Open Problems and Future Directions

Several persistent challenges and directions have emerged (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025):

  • Robustness and Equitability: Alleviating the curse of multilinguality in mLLMs and embeddings, addressing data/resource imbalance, and reducing hallucination in generative QE remain open.
  • Graph-augmented and Knowledge-integrated Expansion: Integration with multilingual Wikidata, large knowledge graphs, and graph-based expansion is proposed to ground expansions and mitigate semantic drift.
  • Dynamic, End-to-End Learning: Adaptive pipelines trained end-to-end, incorporating reinforcement learning (RLHF) for expansion selection and sequence-level feedback, are research priorities.
  • Personalization and User Modeling: Personalized QE—conditioned on user embedding or interaction footprint—improves context matching but requires scalable user representation strategies.
  • Unified, Modular Pipelines: Modular hybrid pipelines that dynamically select and weight expansion strategies per query type, language affinity, and available resources offer a pragmatic blueprint for future systems.

These directions align with the strategic goal of developing CLIR systems that generalize robustly across typologies, domains, and scripts, with flexible and interpretable expansion mechanisms (Goworek et al., 1 Oct 2025, Macmillan-Scott et al., 24 Nov 2025, Ravichandran et al., 21 Feb 2024, Zhuang et al., 2023, Rahman et al., 2019).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cross-Lingual Query Expansion.