Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

To Label or Not to Label: Hybrid Active Learning for Neural Machine Translation (2403.09259v2)

Published 14 Mar 2024 in cs.CL and cs.LG

Abstract: Active learning (AL) techniques reduce labeling costs for training neural machine translation (NMT) models by selecting smaller representative subsets from unlabeled data for annotation. Diversity sampling techniques select heterogeneous instances, while uncertainty sampling methods select instances with the highest model uncertainty. Both approaches have limitations - diversity methods may extract varied but trivial examples, while uncertainty sampling can yield repetitive, uninformative instances. To bridge this gap, we propose Hybrid Uncertainty and Diversity Sampling (HUDS), an AL strategy for domain adaptation in NMT that combines uncertainty and diversity for sentence selection. HUDS computes uncertainty scores for unlabeled sentences and subsequently stratifies them. It then clusters sentence embeddings within each stratum and computes diversity scores by distance to the centroid. A weighted hybrid score that combines uncertainty and diversity is then used to select the top instances for annotation in each AL iteration. Experiments on multi-domain German-English and French-English datasets demonstrate the better performance of HUDS over other strong AL baselines. We analyze the sentence selection with HUDS and show that it prioritizes diverse instances having high model uncertainty for annotation in early AL iterations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Roee Aharoni and Yoav Goldberg. 2020. Unsupervised domain clusters in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7747–7763, Online. Association for Computational Linguistics.
  2. Multi-strategy approaches to active learning for statistical machine translation. In Proceedings of Machine Translation Summit XIII: Papers.
  3. A semi-supervised batch-mode active learning strategy for improved statistical machine translation. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pages 126–134, Uppsala, Sweden. Association for Computational Linguistics.
  4. David Arthur and Sergei Vassilvitskii. 2007. K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035.
  5. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671.
  6. Michael Bloodgood and Chris Callison-Burch. 2010. Bucking the trend: Large-scale cost-focused active learning for statistical machine translation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 854–864, Uppsala, Sweden. Association for Computational Linguistics.
  7. Taking into account the differences between actively and passively acquired data: The case of active learning with support vector machines for imbalanced datasets. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 137–140, Boulder, Colorado. Association for Computational Linguistics.
  8. Sanjoy Dasgupta. 2011. Two faces of active learning. Theoretical computer science, 412(19):1767–1781.
  9. Tobias Domhan and Felix Hieber. 2017. Using target-side monolingual data for neural machine translation through multi-task learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1500–1505, Copenhagen, Denmark. Association for Computational Linguistics.
  10. Active learning for bert: an empirical study. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7949–7962.
  11. Unsupervised domain adaptation for neural machine translation with domain-aware feature embeddings. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong.
  12. Active learning in example-based machine translation. In Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009), pages 227–230, Odense, Denmark. Northern European Association for Language Technology (NEALT).
  13. Investigating active learning in interactive neural machine translation. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 10–22.
  14. Active learning for statistical phrase-based machine translation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 415–423, Boulder, Colorado. Association for Computational Linguistics.
  15. Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 18–24, Melbourne, Australia. Association for Computational Linguistics.
  16. Wei-Ning Hsu and Hsuan-Tien Lin. 2015. Active learning by learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
  17. Junjie Hu and Graham Neubig. 2021. Phrase-level active learning for neural machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 1087–1099, Online. Association for Computational Linguistics.
  18. MMR-based active machine learning for bio named entity recognition. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 69–72, New York City, USA. Association for Computational Linguistics.
  19. Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
  20. Interactive-predictive neural machine translation through reinforcement and imitation. In Proceedings of Machine Translation Summit XVII: Research Track, pages 96–106, Dublin, Ireland. European Association for Machine Translation.
  21. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  22. Selecting syntactic, non-redundant segments in active learning for machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 20–29.
  23. Neural machine translation through active learning on low-resource languages: The case of spanish to mapudungun. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 6–11.
  24. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  25. Active learning for sequence tagging with deep pre-trained models and Bayesian uncertainty estimates. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1698–1712, Online. Association for Computational Linguistics.
  26. Active learning for abstractive text summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5128–5152, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. ALToolbox: A set of tools for active learning annotation of natural language texts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 406–434, Abu Dhabi, UAE. Association for Computational Linguistics.
  28. Nicola Ueffing and Hermann Ney. 2007. Word-level confidence estimation for machine translation. Comput. Linguistics, 33(1):9–40.
  29. Balancing training for multilingual neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8526–8537, Online. Association for Computational Linguistics.
  30. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  31. Cold-start active learning through self-supervised language modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7935–7948, Online. Association for Computational Linguistics.
  32. Active learning for neural machine translation. In 2018 International Conference on Asian Language Processing (IALP), pages 153–158. IEEE.
  33. Curriculum learning for domain adaptation in neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1903–1915, Minneapolis, Minnesota. Association for Computational Linguistics.
  34. Active learning approaches to enhancing neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1796–1806.
  35. Uncertainty-aware curriculum learning for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6934–6944, Online. Association for Computational Linguistics.
  36. Zhong Zhou and Alex Waibel. 2021. Active learning for massively parallel translation of constrained text into low resource languages. arXiv preprint arXiv:2108.07127.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Abdul Hameed Azeemi (7 papers)
  2. Ihsan Ayyub Qazi (9 papers)
  3. Agha Ali Raza (12 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.