Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation (2403.17876v1)

Published 26 Mar 2024 in cs.IR

Abstract: Digital news platforms use news recommenders as the main instrument to cater to the individual information needs of readers. Despite an increasingly language-diverse online community, in which many Internet users consume news in multiple languages, the majority of news recommendation focuses on major, resource-rich languages, and English in particular. Moreover, nearly all news recommendation efforts assume monolingual news consumption, whereas more and more users tend to consume information in at least two languages. Accordingly, the existing body of work on news recommendation suffers from a lack of publicly available multilingual benchmarks that would catalyze development of news recommenders effective in multilingual settings and for low-resource languages. Aiming to fill this gap, we introduce xMIND, an open, multilingual news recommendation dataset derived from the English MIND dataset using machine translation, covering a set of 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. Using xMIND, we systematically benchmark several state-of-the-art content-based neural news recommenders (NNRs) in both zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer scenarios, considering both monolingual and bilingual news consumption patterns. Our findings reveal that (i) current NNRs, even when based on a multilingual LLM, suffer from substantial performance losses under ZS-XLT and that (ii) inclusion of target-language data in FS-XLT training has limited benefits, particularly when combined with a bilingual news consumption. Our findings thus warrant a broader research effort in multilingual and cross-lingual news recommendation. The xMIND dataset is available at https://github.com/andreeaiana/xMIND.

Introducing xMIND: Multilingual Dataset for Cross-lingual News Recommendation

Overview

The paper "MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation" introduces a comprehensive, open multilingual dataset, xMIND, derived from the English MIND dataset using machine translation. This dataset covers 14 linguistically and geographically diverse languages, aiming to bridge the gap in multilingual news recommendation research which predominantly focuses on English and resource-rich languages. The authors systematically benchmark several state-of-the-art content-based neural news recommenders (NNRs) in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual settings, addressing both monolingual and bilingual news consumption scenarios.

Key Contributions

  • xMIND Dataset: The dataset boasts an array of 14 high- and low-resource languages across diverse geographical areas and language families, some of which are underrepresented in current multilingual LLMs (mPLMs). This parallel corpus, derived through machine translation from the English MIND dataset, allows for direct performance comparisons of multilingual news recommenders and cross-lingual transfer approaches.
  • Cross-lingual Recommendation Scenarios: The paper evaluates a variety of content-based NNRs across ZS-XLT and FS-XLT setups, considering both monolingual and bilingual news consumption patterns. Findings highlight the substantial performance drops when recommenders trained on English are tested on other languages (ZS-XLT scenario), and how injecting data in the target language during training (FS-XLT scenario) has limited benefits.
  • Translation Quality Assessment: By translating the MIND dataset into 14 languages using NLLB and comparing against Google Neural Machine Translation for a subset, the authors provide insights into translation quality. Annotations for a sample set indicate generally higher intelligibility and fidelity of translations, albeit with variations across languages.

Findings and Implications

  • The analysis reveals significant performance degradation of existing NNRs when evaluated in cross-lingual settings, showcasing the need for more research into making these systems more robust and effective across languages.
  • The limited gains from FS-XLT training underscore a methodological limitation of simply mixing target language data into training and call for more innovative approaches tailored for multilingual and cross-lingual news recommendation.
  • The comparison of translations generated by NLLB and a commercial system like GNMT, and their impact on recommender performance, suggests NNRs' robustness to translation quality. However, it also suggests that careful consideration of source texts and their translatable content is necessary due to the observed challenges in automatic translation quality.

Future Directions

The introduction of xMIND and the findings from benchmarking efforts encourage several future research directions:

  • Model Architecture Innovation: The mixed results from FS-XLT emphasize exploring new NNR architectures that inherently support multilingual learning and better exploit cross-lingual signals during training.
  • User Behavior Modeling: Given the complexity of bilingual or multilingual news consumption patterns, future work could delve into modeling such behaviors more accurately and dynamically within recommender systems.
  • Domain Adaptation Strategies: Investigating domain adaptation strategies to leverage transfer learning more effectively between languages, especially focusing on low-resource and underrepresented languages, stands out as a promising research avenue.
  • Evaluation Frameworks: Developing more sophisticated evaluation frameworks that closely mimic real-world scenarios of multilingual news consumption can provide deeper insights into the operational effectiveness of these systems.

In summary, the xMIND dataset sets the stage for substantial advancements in the field of multilingual and cross-lingual news recommendation, posing challenges and opportunities for researchers to address the nuanced needs of a diverse global audience.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Neural news recommendation with long-and short-term user representations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 336–345.
  2. Neural machine translation by jointly learning to align and translate. ICLR (2014).
  3. Jack M Balkin. 2017. Free speech in the algorithmic society: Big data, private governance, and new school speech regulation. UCDL rev. 51 (2017), 1149.
  4. Emily Bender. 2019. The# benderrule: On naming the languages we study and why it matters. The Gradient 14 (2019).
  5. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. New York, 35.
  6. Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2. 2787–2795. https://dl.acm.org/doi/abs/10.5555/2999792.2999923
  7. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734. https://doi.org/10.3115/v1/D14-1179
  8. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440–8451.
  9. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 7059–7069.
  10. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672 (2022).
  11. A survey of multilingual neural machine translation. ACM Computing Surveys (CSUR) 53, 5 (2020), 1–38.
  12. Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3608–3626.
  13. News session-based recommendations using deep neural networks. In Proceedings of the 3rd Workshop on Deep Learning for Recommender Systems. 15–23.
  14. Matthew S. Dryer and Martin Haspelmath (Eds.). 2013. WALS Online (v2020.3). Zenodo. https://doi.org/10.5281/zenodo.7385533
  15. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research 22, 107 (2021), 1–48.
  16. Contextual hybrid session-based news recommendation with recurrent neural networks. IEEE Access 7 (2019), 169185–169203.
  17. The adressa dataset for news recommendation. In Proceedings of the international conference on web intelligence. 1042–1048.
  18. Few-shot News Recommendation via Cross-lingual Transfer. In Proceedings of the ACM Web Conference 2023. 1130–1140.
  19. Survey of low-resource machine translation. Computational Linguistics 48, 3 (2022), 673–732.
  20. glottolog/glottolog: Glottolog database 4.4.
  21. F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19.
  22. Natali Helberger. 2021. On the democratic role of news recommenders. In Algorithms, Automation, and News. Routledge, 14–33.
  23. NeMig-A Bilingual News Collection and Knowledge Graph about Migration. In Proceedings of the Workshop on News Recommendation and Analytics co-located with RecSys 2023.
  24. A survey on knowledge-aware news recommender systems. Semantic Web Preprint ([n. d.]), 1–62.
  25. NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 296–310.
  26. Simplifying content-based neural news recommendation: On user modeling and training objectives. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2384–2388.
  27. Train once, use flexibly: A modular framework for multi-aspect neural news recommendation. arXiv preprint arXiv:2307.16089 (2023).
  28. Junxiang Jiang. 2023. TADI: Topic-aware Attention and Powerful Dual-encoder Interaction for Recall in News Recommendation. In Findings of the Association for Computational Linguistics: EMNLP 2023. 15647–15658.
  29. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6282–6293.
  30. Supervised contrastive learning. Advances in neural information processing systems 33 (2020), 18661–18673.
  31. The plista dataset. In Proceedings of the 2013 international news recommender systems workshop and challenge. 16–23.
  32. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746–1751. https://doi.org/10.3115/v1/D14-1181
  33. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. ICLR (2014).
  34. Klaus Krippendorff. 2013. Content analysis: An introduction to its methodology. Sage publications.
  35. MADLAD-400: A Multilingual And Document-Level Large Audited Dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  36. From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4483–4499.
  37. Ethnologue: languages of the world, Dallas, Texas: SIL International. Online version: http://www. ethnologue. com 12, 12 (2009), 2010.
  38. MINER: Multi-interest matching network for news recommendation. In Findings of the Association for Computational Linguistics: ACL 2022. 343–352.
  39. Miaomiao Li and Licheng Wang. 2019. A survey on personalized news recommendation technology. IEEE Access 7 (2019), 145861–145879.
  40. PBNR: Prompt-based News Recommender System. arXiv preprint arXiv:2304.07862 (2023).
  41. Multilingual news–an investigation of consumption, querying, and search result selection behaviors. International Journal of Human–Computer Interaction 36, 6 (2020), 516–535.
  42. Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 8–14.
  43. KRED: Knowledge-aware document representation for news recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems. 200–209. https://doi.org/10.1145/3383313.3412237
  44. NPR: a News Portal Recommendations dataset. In Proceedings of the The First Workshop on the Normative Design and Evaluation of Recommender Systems (NORMalize 2023), co-located with the ACM Conference on Recommender Systems 2023 (RecSys 2023).
  45. Eli Pariser. 2011. The filter bubble: What the Internet is hiding from you. Penguin UK.
  46. POTATO: The Portable Text Annotation Tool. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 327–337.
  47. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2362–2376.
  48. Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, Brussels, Belgium, 186–191. https://doi.org/10.18653/v1/W18-6319
  49. Personalized news recommendation with knowledge-aware interactive matching. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 61–70. https://doi.org/10.1145/3404835.3462861
  50. PP-Rec: News Recommendation with Personalized User Interest and Time-aware News Popularity. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5457–5467. https://doi.org/10.18653/v1/2021.acl-long.424
  51. FUM: fine-grained and fast user modeling for news recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1974–1978.
  52. News recommendation with candidate-aware user modeling. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1917–1921.
  53. Privacy-Preserving News Recommendation Model Learning. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1423–1432. https://doi.org/10.18653/v1/2020.findings-emnlp.128
  54. HieRec: Hierarchical User Interest Modeling for Personalized News Recommendation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5446–5456. https://doi.org/10.18653/v1/2021.acl-long.423
  55. Don’t stop fine-tuning: On training regimes for few-shot cross-lingual transfer with multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 10725–10742.
  56. DCAN: Diversified news recommendation with coverage-attentive networks. arXiv preprint arXiv:2206.02627 (2022). https://doi.org/10.48550/arXiv.2206.02627
  57. Jörg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (23-25), Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Istanbul, Turkey.
  58. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  59. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010. https://dl.acm.org/doi/abs/10.5555/3295222.3295349
  60. DKN: Deep knowledge-aware network for news recommendation. In Proceedings of the 2018 world wide web conference. 1835–1844. https://doi.org/10.1145/3178876.3186175
  61. News recommendation via multi-interest news sequence modelling. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7942–7946.
  62. Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018 (2023).
  63. On Learning Universal Representations Across Languages. In International Conference on Learning Representations.
  64. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  65. Neural news recommendation with attentive multi-view learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 3863–3869. https://doi.org/10.24963/ijcai.2019/536
  66. NPA: neural news recommendation with personalized attention. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2576–2584. https://doi.org/10.1145/3292500.3330665
  67. Neural news recommendation with topic-aware news representation. In Proceedings of the 57th Annual meeting of the association for computational linguistics. 1154–1159. https://doi.org/10.18653/v1/P19-1110
  68. Neural news recommendation with multi-head self-attention. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 6389–6394. https://doi.org/10.18653/v1/D19-1671
  69. Rethinking InfoNCE: How Many Negative Samples Do You Need?. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 2509–2515. https://doi.org/10.24963/ijcai.2022/348
  70. Personalized news recommendation: Methods and challenges. ACM Transactions on Information Systems 41, 1 (2023), 1–50.
  71. SentiRec: Sentiment diversity-aware neural news recommendation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 44–53. https://aclanthology.org/2020.aacl-main.6
  72. Empowering news recommendation with pre-trained language models. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1652–1656.
  73. End-to-end Learnable Diversity-aware News Recommendation. arXiv preprint arXiv:2204.00539 (2022). https://doi.org/10.48550/arXiv.2204.00539
  74. Removing AI’s sentiment manipulation of personalized news delivery. Humanities and Social Sciences Communications 9, 1 (2022), 1–9.
  75. Mind: A large-scale dataset for news recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3597–3606.
  76. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
  77. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 483–498.
  78. Tiny-NewsRec: Effective and Efficient PLM-based News Recommendation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 5478–5489. https://aclanthology.org/2022.emnlp-main.368
  79. Zizhuo Zhang and Bang Wang. 2023. Prompt learning for news recommendation. arXiv preprint arXiv:2304.05263 (2023).
  80. Ethan Zuckerman. 2008. The polyglot internet. (2008). https://ethanzuckerman.com/the-polyglot-internet/
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Andreea Iana (11 papers)
  2. Goran Glavaš (82 papers)
  3. Heiko Paulheim (65 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com