Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery (2007.01359v3)

Published 2 Jul 2020 in cs.CL

Abstract: In this paper, we present a Bayesian multilingual document model for learning language-independent document embeddings. The model is an extension of BaySMM [Kesiraju et al 2020] to the multilingual scenario. It learns to represent the document embeddings in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. We propagate the learned uncertainties through linear classifiers that benefit zero-shot cross-lingual topic identification. Our experiments on 17 languages show that the proposed multilingual Bayesian document model performs competitively, when compared to other systems based on large-scale neural networks (LASER, XLM-R, mUSE) on 8 high-resource languages, and outperforms these systems on 9 mid-resource languages. We revisit cross-lingual topic identification in zero-shot settings by taking a deeper dive into current datasets, baseline systems and the languages covered. We identify shortcomings in the existing evaluation protocol (MLDoc dataset), and propose a robust alternative scheme, while also extending the cross-lingual experimental setup to 17 languages. Finally, we consolidate the observations from all our experiments, and discuss points that can potentially benefit the future research works in applications relying on cross-lingual transfers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online, Nov. 2021. ACL.
  2. Massively multilingual word embeddings. CoRR, abs/1602.01925, 2016.
  3. M. Artetxe and H. Schwenk. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the ACL, 7:597–610, 2019.
  4. C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
  5. D. M. Blei. Probabilistic Topic Models. Commun. ACM, 55(4):77–84, Apr. 2012. ISSN 0001-0782.
  6. A. Bérard. Continual Learning in Multilingual NMT via Language-Specific Embeddings. In Proc. of the Sixth Conference on Machine Translation (WMT), pages 542–565. ACL, Nov 2021.
  7. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. ACL.
  8. A primer on pretrained multilingual language models. CoRR, abs/2107.00676, 2021.
  9. A. Eisele and Y. Chen. Multiun: A multilingual corpus from united nation documents. In Proceedings of the International Conference on LREC, 17-23 May 2010, Valletta, Malta. ELRA, 2010.
  10. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on EMNLP, pages 5960–5969, Online, Nov. 2020. ACL.
  11. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the ACL (Volume 1: Long Papers), pages 878–891, Dublin, Ireland, May 2022. ACL.
  12. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. In Proceedings of the 37th ICML, July 2020.
  13. Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks. In Proceedings of the 2019 Conference on EMNLP 9th IJCNLP, 2019, Hong Kong, China, November 3-7, 2019, pages 2485–2494. ACL, 2019.
  14. Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In T. Cohn, Y. He, and Y. Liu, editors, Findings of the ACL: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 4948–4961. ACL, 2020.
  15. A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc., 2017.
  16. Learning Document Embeddings Along With Their Uncertainties. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2319–2332, 2020.
  17. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  18. D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR Conference Track Proceedings, Banff, AB, Canada, April 2014.
  19. P. Koehn. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the 10th Machine Translation Summit, pages 79–86, Phuket, Thailand, 2005. AAMT, AAMT.
  20. Neural variational inference for text processing. In Proceedings of the 33rd International Conference on ICML, ICML’16, pages 1727–1736, New York, NY, USA, 2016. JMLR.org.
  21. Polylingual topic models. In Proceedings of the 2009 Conference on EMNLP, pages 880–889, Singapore, Aug. 2009. ACL.
  22. Automatic differentiation in PyTorch. In NIPS Workshop, 2017.
  23. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  24. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages. Transactions of the ACL, 10:145–162, 02 2022. ISSN 2307-387X.
  25. N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on EMNLP-IJCNLP, pages 3982–3992, Hong Kong, China, Nov. 2019. ACL.
  26. N. Reimers and I. Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on EMNLP, pages 4512–4525, Online, Nov. 2020. ACL.
  27. Stochastic backpropagation and approximate inference in deep generative models. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st ICML, volume 32 of Proceedings of Machine Learning Research, pages 1278–1286, Bejing, China, 22–24 Jun 2014. PMLR.
  28. A survey of cross-lingual word embedding models. J. Artif. Int. Res., 65(1):569–630, May 2019. ISSN 1076-9757.
  29. H. Schwenk and M. Douze. Learning Joint Multilingual Sentence Representations with Neural Machine Translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017, pages 157–167, 2017.
  30. H. Schwenk and X. Li. A Corpus for Multilingual Document Classification in Eight Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018., 2018.
  31. Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI NY, USA, February 7-12, 2020, pages 8854–8861. AAAI Press, 2020.
  32. A multilingual parallel corpora collection effort for Indian languages. In Proceedings of the 12th LREC, pages 3743–3751, Marseille, France, May 2020. ELRA. ISBN 979-10-95546-34-4.
  33. S. Strassel and J. Tracey. LORELEI language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the 10th International Conference on LREC, pages 3273–3280, Portorož, Slovenia, May 2016. ELRA.
  34. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3645–3650. ACL, 2019.
  35. J. Tiedemann. Parallel data, tools and interfaces in OPUS. In N. Calzolari, K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, editors, Proceedings of the 8th International Conference on LREC, Istanbul, Turkey, May 23-25, pages 2214–2218. ELRA, 2012.
  36. S. Wu and M. Dredze. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proceedings of the 2019 Conference on EMNLP and the 9th IJCNLP, pages 833–844, Hong Kong, China, nov 2019. ACL.
  37. Y. Xiao and W. Y. Wang. Quantifying Uncertainties in Natural Language Processing Tasks. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 7322–7329, 2019.
  38. A multilingual topic model for learning weighted topic links across corpora with low comparability. In Proceedings of the 2019 Conference on EMNLP-IJCNLP, pages 1243–1248, Hong Kong, China, Nov. 2019. ACL.
  39. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the ACL: System Demonstrations, pages 87–94, Online, July 2020. ACL.
  40. The united nations parallel corpus v1. 0. In Proceedings of the 10th International Conference on LREC, April 2016.

Summary

We haven't generated a summary for this paper yet.