MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation (2403.17876v1)

Published 26 Mar 2024 in cs.IR

Abstract: Digital news platforms use news recommenders as the main instrument to cater to the individual information needs of readers. Despite an increasingly language-diverse online community, in which many Internet users consume news in multiple languages, the majority of news recommendation focuses on major, resource-rich languages, and English in particular. Moreover, nearly all news recommendation efforts assume monolingual news consumption, whereas more and more users tend to consume information in at least two languages. Accordingly, the existing body of work on news recommendation suffers from a lack of publicly available multilingual benchmarks that would catalyze development of news recommenders effective in multilingual settings and for low-resource languages. Aiming to fill this gap, we introduce xMIND, an open, multilingual news recommendation dataset derived from the English MIND dataset using machine translation, covering a set of 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. Using xMIND, we systematically benchmark several state-of-the-art content-based neural news recommenders (NNRs) in both zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer scenarios, considering both monolingual and bilingual news consumption patterns. Our findings reveal that (i) current NNRs, even when based on a multilingual LLM, suffer from substantial performance losses under ZS-XLT and that (ii) inclusion of target-language data in FS-XLT training has limited benefits, particularly when combined with a bilingual news consumption. Our findings thus warrant a broader research effort in multilingual and cross-lingual news recommendation. The xMIND dataset is available at https://github.com/andreeaiana/xMIND.

PDF HTML Abstract

Introducing xMIND: Multilingual Dataset for Cross-lingual News Recommendation

Overview

The paper "MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation" introduces a comprehensive, open multilingual dataset, xMIND, derived from the English MIND dataset using machine translation. This dataset covers 14 linguistically and geographically diverse languages, aiming to bridge the gap in multilingual news recommendation research which predominantly focuses on English and resource-rich languages. The authors systematically benchmark several state-of-the-art content-based neural news recommenders (NNRs) in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual settings, addressing both monolingual and bilingual news consumption scenarios.

Key Contributions

xMIND Dataset: The dataset boasts an array of 14 high- and low-resource languages across diverse geographical areas and language families, some of which are underrepresented in current multilingual LLMs (mPLMs). This parallel corpus, derived through machine translation from the English MIND dataset, allows for direct performance comparisons of multilingual news recommenders and cross-lingual transfer approaches.
Cross-lingual Recommendation Scenarios: The paper evaluates a variety of content-based NNRs across ZS-XLT and FS-XLT setups, considering both monolingual and bilingual news consumption patterns. Findings highlight the substantial performance drops when recommenders trained on English are tested on other languages (ZS-XLT scenario), and how injecting data in the target language during training (FS-XLT scenario) has limited benefits.
Translation Quality Assessment: By translating the MIND dataset into 14 languages using NLLB and comparing against Google Neural Machine Translation for a subset, the authors provide insights into translation quality. Annotations for a sample set indicate generally higher intelligibility and fidelity of translations, albeit with variations across languages.

Findings and Implications

The analysis reveals significant performance degradation of existing NNRs when evaluated in cross-lingual settings, showcasing the need for more research into making these systems more robust and effective across languages.
The limited gains from FS-XLT training underscore a methodological limitation of simply mixing target language data into training and call for more innovative approaches tailored for multilingual and cross-lingual news recommendation.
The comparison of translations generated by NLLB and a commercial system like GNMT, and their impact on recommender performance, suggests NNRs' robustness to translation quality. However, it also suggests that careful consideration of source texts and their translatable content is necessary due to the observed challenges in automatic translation quality.

Future Directions

The introduction of xMIND and the findings from benchmarking efforts encourage several future research directions:

Model Architecture Innovation: The mixed results from FS-XLT emphasize exploring new NNR architectures that inherently support multilingual learning and better exploit cross-lingual signals during training.
User Behavior Modeling: Given the complexity of bilingual or multilingual news consumption patterns, future work could delve into modeling such behaviors more accurately and dynamically within recommender systems.
Domain Adaptation Strategies: Investigating domain adaptation strategies to leverage transfer learning more effectively between languages, especially focusing on low-resource and underrepresented languages, stands out as a promising research avenue.
Evaluation Frameworks: Developing more sophisticated evaluation frameworks that closely mimic real-world scenarios of multilingual news consumption can provide deeper insights into the operational effectiveness of these systems.

In summary, the xMIND dataset sets the stage for substantial advancements in the field of multilingual and cross-lingual news recommendation, posing challenges and opportunities for researchers to address the nuanced needs of a diverse global audience.

PDF Markdown Bookmark Chat (Pro)

References (80)

Authors (3)

Andreea Iana (11 papers)
Goran Glavaš (82 papers)
Heiko Paulheim (65 papers)

Citations (2)

View on Semantic Scholar

Tweets

https://twitter.com/iana_andreea/status/1772951003895070828