Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LexGen: Domain-aware Multilingual Lexicon Generation (2405.11200v2)

Published 18 May 2024 in cs.CL

Abstract: Lexicon or dictionary generation across domains is of significant societal importance, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping-based or corpora-based approaches. Though initiated by researchers, the research associated with lexicon generation is limited, even more so with domain-specific lexicons. This task becomes particularly important in atypical medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and negligibly low data availability of technical terms in many low-resource languages. Owing to the research gap in lexicon generation, especially with a limited focus on the domain-specific area, we propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. Further, we propose an approach to explicitly leverage the relatedness between these Indian languages toward coherent translation. We also release a new benchmark dataset across 6 Indian languages that span 8 diverse domains that can propel further research in domain-specific lexicon induction. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019.
  2. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798.
  3. Layer normalization. arXiv preprint arXiv:1607.06450.
  4. Controlling computation versus quality for neural sequence models. arXiv preprint arXiv:2002.07106.
  5. Yves Bestgen. 2022. Creating bilingual dictionaries from existing ones by means of pivot-oriented translation inference and logistic regression. In Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference, pages 26–31.
  6. George Cardona and Dhanesh Jain. 2007. The Indo-Aryan Languages. Routledge.
  7. Lexical-constraint-aware neural machine translation via data augmentation. In Proceedings of IJCAI 2020: Main track, pages 3587–3593.
  8. Chung-Cheng Chiu and Colin Raffel. 2018. Monotonic chunkwise attention. In International Conference on Learning Representations.
  9. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  10. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  11. Bilingual dictionary generation and enrichment via graph exploration. Semantic Web, (Preprint):1–30.
  12. Domain adaptation of neural machine translation by lexicon induction. arXiv preprint arXiv:1906.00376.
  13. Geometry-aware domain adaptation for unsupervised alignment of word embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3052–3058.
  14. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  15. Word translation without parallel data. In International Conference on Learning Representations.
  16. On bilingual lexicon induction with large language models. arXiv preprint arXiv:2310.13995.
  17. Improving bilingual lexicon induction with cross-encoder reranking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4100–4116.
  18. Adaptive token-level cross-lingual feature mixing for multilingual neural machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10097–10113.
  19. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  20. Dictdis: Dictionary constrained disambiguation for improved nmt.
  21. Colin P. Masica. 1991. The Indo-Aryan Languages. Cambridge University Press.
  22. Lnmap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2712–2723.
  23. Muhammad Tasnim Mohiuddin and Shafiq Joty. 2019. Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3857–3867.
  24. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  25. Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 184–193.
  26. Maja Popović. 2017. chrf++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618.
  27. Samanantar: The largest publicly available parallel corpora collection for 11 indic languages. Transactions of the Association for Computational Linguistics, 10:145–162.
  28. A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3476–3485.
  29. RAPO: An adaptive ranking paradigm for bilingual lexicon induction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8870–8883, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  31. Istvan Varga and Shoichi Yokoyama. 2009. Bilingual dictionary generation for low-resourced language pairs. In Proceedings of the 2009 conference on empirical methods in natural language processing, pages 862–870.
  32. Attention is all you need. Advances in neural information processing systems, 30.
  33. Domain generalisation of nmt: Fusing adapters with leave-one-domain-out training. In Findings of the Association for Computational Linguistics: ACL 2022, pages 582–588.
  34. Share or not? learning to schedule language-specific capacity for multilingual translation. In International Conference on Learning Representations.
  35. Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–1639.
  36. Semi-supervised bilingual lexicon induction with two-way interaction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2973–2984.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Karthika NJ (2 papers)
  2. Ayush Maheshwari (19 papers)
  3. Atul Kumar Singh (2 papers)
  4. Preethi Jyothi (51 papers)
  5. Ganesh Ramakrishnan (88 papers)
  6. Krishnakant Bhatt (2 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.