Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection (2311.16302v1)

Published 27 Nov 2023 in cs.LG and cs.CL

Abstract: While data selection methods have been studied extensively in active learning, data pruning, and data augmentation settings, there is little evidence for the efficacy of these methods in industry scale settings, particularly in low-resource languages. Our work presents ways of assessing prospective training examples in those settings for their "usefulness" or "difficulty". We also demonstrate how these measures can be used in selecting important examples for training supervised machine learning models. We primarily experiment with entropy and Error L2-Norm (EL2N) scores. We use these metrics to curate high quality datasets from a large pool of \textit{Weak Signal Labeled} data, which assigns no-defect high confidence hypotheses during inference as ground truth labels. We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019.
  2. Language models are few-shot learners, 2020.
  3. Selection via proxy: Efficient data selection for deep learning. arXiv preprint arXiv:1906.11829, 2019.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. URL https://arxiv.org/abs/1810.04805.
  5. A survey on recent approaches for natural language processing in low-resource scenarios, 2021.
  6. Training compute-optimal large language models, 2022.
  7. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745, 2011.
  8. Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data, 2023a.
  9. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023b.
  10. On the importance of effectively adapting pretrained language models for active learning. arXiv preprint arXiv:2104.08320, 2021.
  11. When less is more: Investigating data pruning for pretraining llms at scale, 2023.
  12. Deep learning on a data diet: Finding important examples early in training. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 20596–20607. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ac56f8fe9eea3e4a365f29f0f1957c55-Paper.pdf.
  13. Improving large-scale conversational assistants using model interpretation based training sample selection. 2022.
  14. C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 1948. URL http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.
  15. The curse of recursion: Training on generated data makes models forget, 2023.
  16. Dynamic data selection for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1400–1410, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1147. URL https://aclanthology.org/D17-1147.
  17. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  18. Actune: Uncertainty-based active self-training for active fine-tuning of pretrained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1422–1436, 2022.

Summary

We haven't generated a summary for this paper yet.