Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation (2402.19267v2)

Published 29 Feb 2024 in cs.CL and cs.AI

Abstract: Low-resourced data presents a significant challenge for neural machine translation. In most cases, the low-resourced environment is caused by high costs due to the need for domain experts or the lack of language experts. Therefore, identifying the most training-efficient data within an unsupervised setting emerges as a practical strategy. Recent research suggests that such effective data can be identified by selecting 'appropriately complex data' based on its volume, providing strong intuition for unsupervised data selection. However, we have discovered that establishing criteria for unsupervised data selection remains a challenge, as the 'appropriate level of difficulty' may vary depending on the data domain. We introduce a novel unsupervised data selection method named 'Capturing Perplexing Named Entities,' which leverages the maximum inference entropy in translated named entities as a metric for selection. When tested with the 'Korean-English Parallel Corpus of Specialized Domains,' our method served as robust guidance for identifying training-efficient data across different domains, in contrast to existing methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. In-context examples selection for machine translation.
  2. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota. Association for Computational Linguistics.
  3. The low-resource double bind: An empirical study of pruning for low-resource machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3316–3333, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  4. Data scaling laws in nmt: The effect of noise and architecture. In International Conference on Machine Learning, pages 1466–1482. PMLR.
  5. A statistical approach to machine translation. Computational linguistics, 16(2):79–85.
  6. Few-NERD: A few-shot named entity recognition dataset. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3198–3213, Online. Association for Computational Linguistics.
  7. Marzieh Fadaee and Christof Monz. 2018. Back-translation sampling by targeting difficult words in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 436–446, Brussels, Belgium. Association for Computational Linguistics.
  8. Vitaly Feldman and Chiyuan Zhang. 2020. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891.
  9. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
  10. Understanding transformer memorization recall through idioms. arXiv preprint arXiv:2210.03588.
  11. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  12. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  13. Self-training sampling with monolingual data uncertainty for neural machine translation. arXiv preprint arXiv:2106.00941.
  14. Martin Joos. 1936. Language, 12(3):196–210.
  15. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  16. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
  17. Small data, big impact: Leveraging minimal data for effective machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2740–2756, Toronto, Canada. Association for Computational Linguistics.
  18. Trivial or impossible — dichotomous data difficulty masks model differences (on imagenet and beyond). In International Conference on Learning Representations.
  19. Fast-paced improvements to named entity handling for neural machine translation. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 141–149, Ghent, Belgium. European Association for Machine Translation.
  20. Data diversification: A simple strategy for neural machine translation. Advances in Neural Information Processing Systems, 33:10018–10029.
  21. No language left behind: Scaling human-centered machine translation.
  22. Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems.
  23. Álvaro Peris and Francisco Casacuberta. 2018. Active learning for interactive neural machine translation of data streams. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 151–160, Brussels, Belgium. Association for Computational Linguistics.
  24. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  25. Matt Post. 2018. A call for clarity in reporting bleu scores. WMT 2018, page 186.
  26. On the spectral bias of neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5301–5310. PMLR.
  27. The curious case of hallucinations in neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1172–1183, Online. Association for Computational Linguistics.
  28. Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, pages 1030–1040, Online. Association for Computational Linguistics.
  29. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  30. Improving the quality of neural machine translation through proper translation of name entities. In 2023 6th International Conference on Information Systems and Computer Networks (ISCON), pages 1–4.
  31. Beyond neural scaling laws: beating power law scaling via data pruning. In Advances in Neural Information Processing Systems.
  32. Selecting backtranslated data from multiple sources for improved neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3898–3908, Online. Association for Computational Linguistics.
  33. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
  34. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations.
  35. Neural machine translation incorporating named entity. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3240–3250, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  36. Uncertainty-aware balancing for multilingual and multi-domain neural machine translation training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7291–7305.
  37. End-to-end entity-aware neural machine translation. Machine Learning, pages 1–23.
  38. Vega-mt: The jd explore academy translation system for wmt22.
  39. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
  40. Knowledge graph enhanced neural machine translation via multi-task learning on sub-entity granularity. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4495–4505.
  41. Uncertainty-aware curriculum learning for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6934–6944, Online. Association for Computational Linguistics.
  42. Flitto. 2021. Korean-English Parallel Corpus of Specialized Domains. AI Hub.

Summary

We haven't generated a summary for this paper yet.