Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Data Selection for Language Models (2402.16827v3)

Published 26 Feb 2024 in cs.CL and cs.LG

Abstract: A major factor in the recent success of LLMs is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

Comprehensive Review on Data Selection Methods for LLMs

Introduction to Data Selection in Machine Learning

Data selection is a pivotal aspect of the machine learning pipeline, particularly relevant in the age of LLMs which are trained on massive, heterogeneous corpora. Selecting the right data for training these models is not a straightforward task—it involves identifying which subsets of data will lead to the best model performance in terms of accuracy, efficiency, and fairness. The challenge lies not only in handling the sheer volume of available data but also in mitigating the variance in its quality.

Taxonomy of Data Selection Methods

A broad classification of data selection practices can be encapsulated into two primary goals: matching the distribution of the training data to the target task (distribution matching) and enhancing the coverage and diversity of the dataset (distribution diversification). Both approaches have their applications, with the former being crucial for domain-specific tasks requiring high precision, and the latter for general-purpose models necessitating robustness and broad applicability.

The process of data selection comprises several strategic components, notably:

  • Utility Function Definition: This involves mapping data points to a numeric value representing their utility, which is crucial for filtering and prioritizing data.
  • Selection Mechanism: Utilized to decide which data points are included in the training set based on their assigned utility values.
  • Dataset Characteristics Adjustment: Methods under this category operate on altering the dataset's distribution to favor certain characteristics deemed desirable for the training objectives.

Pretraining Data Selection

For pretraining LLMs, the goal is often to filter and curate data from extensive datasets like the Common Crawl corpus, ensuring the removal of low-quality or irrelevant information while retaining high-quality content. Various heuristic approaches are employed for this purpose, alongside more sophisticated model-based and perplexity-based quality filtering. The challenge is to achieve a balance that favors data efficiency and model performance without introducing significant biases.

Enhancing LLM Performance through Specific Data Selection Techniques

  • Fine-tuning and Multitask Learning: These methods leverage auxiliary datasets or diverse tasks to improve model performance on specific targets or across a multitude of tasks. The emphasis here is on domain-specific selection, where additional data is judiciously chosen to closely mirror the task at hand.
  • In-Context Learning: Techniques focusing on selecting or generating potent demonstrations within prompts to guide the model more effectively, demonstrating how precision in data selection can significantly influence model behavior even without direct training on that data.
  • Task-specific Fine-tuning: Task-specific settings call for strategies that either increase the training data’s alignment with the target task or optimize data efficiency and robustness by carefully curating and diversifying the training samples.

Future Directions and Challenges

The review underlines the nuanced trade-offs between memorization and generalization inherent in data selection decisions. Innovations in direct data evaluation metrics, development of comprehensive benchmarks, and the shift towards more comprehensive data processing strategies are highlighted as key future directions.

Conclusion

This survey aims to provide a structured understanding of the landscape of data selection methods in machine learning, with a focus on LLMs. It emphasizes the intricate balance required in selecting data that both aligns with target tasks and ensures models are robust, fair, and efficient. As the field evolves, so too will the strategies for selecting the optimal datasets, underscoring the importance of continued research and innovation in this space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (309)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993–1004, aug 2016. ISSN 2150-8097. doi: 10.14778/2994509.2994518. URL https://doi.org/10.14778/2994509.2994518.
  3. One-network adversarial fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  2412–2420, 2019.
  4. Url normalization for de-duplication of web pages. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp.  1987–1990, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585123. doi: 10.1145/1645953.1646283. URL https://doi.org/10.1145/1645953.1646283.
  5. Muppet: Massive multi-task representations with pre-finetuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5799–5811, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.468. URL https://aclanthology.org/2021.emnlp-main.468.
  6. HTLM: Hyper-text pre-training and prompting of language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=P-pPW1nxf1r.
  7. D-REX: Dialogue relation extraction with explanations. In Bing Liu, Alexandros Papangelis, Stefan Ultes, Abhinav Rastogi, Yun-Nung Chen, Georgios Spithourakis, Elnaz Nouri, and Weiyan Shi (eds.), Proceedings of the 4th Workshop on NLP for Conversational AI, pp.  34–46, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.nlp4convai-1.4. URL https://aclanthology.org/2022.nlp4convai-1.4.
  8. FETA: A benchmark for few-sample task transfer in open-domain dialogue. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  10936–10953, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.751. URL https://aclanthology.org/2022.emnlp-main.751.
  9. Efficient online data mixing for language model pre-training, 2023a.
  10. Improving few-shot generalization by exploring and exploiting auxiliary data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=JDnLXc4NOn.
  11. Santacoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
  12. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 9 2023. URL https://www.github.com/eleutherai/gpt-neox.
  13. Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021.
  14. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
  15. Program synthesis with large language models, 2021.
  16. Amittai Axelrod. Cynical selection of language model training data, 2017.
  17. Data pruning for efficient model pruning in neural machine translation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  236–246, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.18. URL https://aclanthology.org/2023.findings-emnlp.18.
  18. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  19. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  20. Notus. https://github.com/argilla-io/notus, 2023.
  21. Theoretical guarantees on the best-of-n alignment policy. arXiv preprint arXiv:2401.01879, 2024.
  22. Make every example count: On the stability and utility of self-influence for learning from noisy NLP datasets. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10107–10121, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.625. URL https://aclanthology.org/2023.emnlp-main.625.
  23. Emergent and predictable memorization in large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=Iq0DvhB4Kf.
  24. Pythia: A suite for analyzing large language models across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  2397–2430. PMLR, 23–29 Jul 2023b. URL https://proceedings.mlr.press/v202/biderman23a.html.
  25. Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp.  981–993, 2021.
  26. Demographic dialectal variation in social media: A case study of African-American English. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1119–1130, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1120. URL https://aclanthology.org/D16-1120.
  27. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426, jul 1970. ISSN 0001-0782. doi: 10.1145/362686.362692. URL https://doi.org/10.1145/362686.362692.
  28. The foundation model transparency index. arXiv preprint arXiv:2310.12941, 2023.
  29. Coresets via bilevel optimization for continual learning and streaming. Advances in neural information processing systems, 33:14879–14890, 2020.
  30. A large annotated corpus for learning natural language inference. In Lluís Màrquez, Chris Callison-Burch, and Jian Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
  31. A.Z. Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp.  21–29, 1997. doi: 10.1109/SEQUEN.1997.666900.
  32. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pp.  123–132, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380539. doi: 10.1145/3406325.3451131. URL https://doi.org/10.1145/3406325.3451131.
  33. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  34. Genie: Generative interactive environments, 2024.
  35. Instruction mining: When data mining meets large language model finetuning, 2023.
  36. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, 2018. URL https://api.semanticscholar.org/CorpusID:170076423.
  37. Extracting training data from large language models. In USENIX Security Symposium, 2020. URL https://api.semanticscholar.org/CorpusID:229156229.
  38. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=TatRHT_1cK.
  39. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=bx24KpJ4Eb. Survey Certification.
  40. Suppressing pink elephants with direct principle feedback, 2024.
  41. Language id in the wild: Unexpected challenges on the path to a thousand-language web text corpus. arXiv preprint arXiv:2010.14571, 2020.
  42. Impact of imputation strategies on fairness in machine learning. Journal of Artificial Intelligence Research, 74:1011–1035, 2022.
  43. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4750–4759, 2022.
  44. Bias in machine learning software: why? how? what to do? In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, pp.  429–440, New York, NY, USA, August 2021. Association for Computing Machinery. ISBN 978-1-4503-8562-6. doi: 10.1145/3468264.3468537. URL https://dl.acm.org/doi/10.1145/3468264.3468537.
  45. Active bias: Training more accurate neural networks by emphasizing high variance samples. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  1002–1012, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/2f37d10131f2a483a8dd005b3d14b0d9-Abstract.html.
  46. Careful data curation stabilizes in-context learning. arXiv preprint arXiv:2212.10378, 2022.
  47. Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pp.  380–388, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 1581134959. doi: 10.1145/509907.509965. URL https://doi.org/10.1145/509907.509965.
  48. Kyle Chayka. Is A.I. Art Stealing from Artists? The New Yorker, February 2023. ISSN 0028-792X. URL https://www.newyorker.com/culture/infinite-scroll/is-ai-art-stealing-from-artists. Section: infinite scroll.
  49. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. Transactions of the Association for Computational Linguistics, 11:191–211, 03 2023a. ISSN 2307-387X. doi: 10.1162/tacl_a_00542. URL https://doi.org/10.1162/tacl_a_00542.
  50. Evaluating large language models trained on code, 2021.
  51. Skill-it! a data-driven skills framework for understanding and training language models, 2023b.
  52. Super-samples from kernel herding. arXiv preprint arXiv:1203.3472, 2012.
  53. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  54. Palm: Scaling language modeling with pathways, 2022.
  55. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  56. Training verifiers to solve math word problems, 2021.
  57. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2019.
  58. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJg2b0VYDr.
  59. Together Computer. Redpajama: an open dataset for training large language models, October 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  60. Cross-lingual language model pretraining. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf.
  61. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  62. Ultrafeedback: Boosting language models with high-quality feedback, 2023a.
  63. Scaling up dataset distillation to imagenet-1k with constant memory. In International Conference on Machine Learning, pp.  6565–6590. PMLR, 2023b.
  64. Emilia David. Ai image training dataset found to include child sexual abuse imagery. The Verge, December 2023. URL https://www.theverge.com/2023/12/20/24009418/generative-ai-image-laion-csam-google-stability-stanford. 7:57 AM PST.
  65. Robust learning with progressive data expansion against spurious correlation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=9QEVJ9qm46.
  66. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019. URL https://aclanthology.org/N19-1423.
  67. Nl-augmenter: A framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721, 2021.
  68. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URL https://aclanthology.org/2021.emnlp-main.98.
  69. GLaM: Efficient scaling of language models with mixture-of-experts. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5547–5569. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/du22c.html.
  70. What’s in my big data? In The Twelfth International Conference on Learning Representations, 2023.
  71. More Effective Boilerplate Removal - the GoldMiner Algorithm. Polibits - Research journal on Computer science and computer engineering with applications, 1(48):79–83, 2013. ISSN 1870-9044. URL http://polibits.gelbukh.com/2013_48.
  72. Dsdm: Model-aware dataset selection with datamodels, 2024.
  73. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ethayarajh22a.html.
  74. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  75. Irreducible curriculum for language model pretraining, 2023.
  76. Doge: Domain reweighting with generalization estimation. arXiv preprint, 2023.
  77. Data filtering networks, 2023.
  78. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  968–988, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.84. URL https://aclanthology.org/2021.findings-acl.84.
  79. Automatic document selection for efficient encoder pretraining. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9522–9530, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.647. URL https://aclanthology.org/2022.emnlp-main.647.
  80. Missing the missing values: The ugly duckling of fairness in machine learning. International Journal of Intelligent Systems, 36(7):3217–3258, 2021.
  81. Big data curation. New horizons for a data-driven economy: A roadmap for usage and exploitation of big data in Europe, pp.  87–118, 2016.
  82. Datacomp: In search of the next generation of multimodal datasets. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=dVaWCDMBof.
  83. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  84. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023a. URL https://dl.acm.org/doi/10.5555/3618408.3618845.
  85. Ambiguity-aware in-context learning with large language models. arXiv preprint arXiv:2309.07900, 2023b.
  86. Data shapley: Equitable valuation of data for machine learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2242–2251. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/ghorbani19c.html.
  87. Preprocessing matters: Automated pipeline selection for fair classification. In International Conference on Modeling Decisions for Artificial Intelligence, pp.  202–213. Springer, 2023.
  88. Lightweight inspection of data preprocessing in native machine learning pipelines. In Conference on Innovative Data Systems Research (CIDR), 2021.
  89. The trade-offs of domain adaptation for neural language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3802–3813, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.264. URL https://aclanthology.org/2022.acl-long.264.
  90. Learning word vectors for 157 languages. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1550.
  91. Automated curriculum learning for neural networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1311–1320. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/graves17a.html.
  92. Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
  93. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  94. Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications, pp.  181–195. Springer, 2022.
  95. Towards lossless dataset distillation via difficulty-aligned trajectory matching. In The Twelfth International Conference on Learning Representations, 2024.
  96. Coverage-based example selection for in-context learning. arXiv preprint arXiv:2305.14907, 2023.
  97. Whose language counts as high quality? measuring language ideologies in text data selection. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2562–2580, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.165. URL https://aclanthology.org/2022.emnlp-main.165.
  98. On-demand sampling: Learning optimally from multiple distributions. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  406–419. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/02917acec264a52a729b99d9bc857909-Paper-Conference.pdf.
  99. Balancing out bias: Achieving fairness through balanced training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11335–11350, 2022.
  100. Take one step at a time to know incremental utility of demonstration: An analysis on reranking for few-shot in-context learning. arXiv preprint arXiv:2311.09619, 2023.
  101. Kenneth Heafield. KenLM: Faster and smaller language model queries. In Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan (eds.), Proceedings of the Sixth Workshop on Statistical Machine Translation, pp.  187–197, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. URL https://aclanthology.org/W11-2123.
  102. Measuring coding challenge competence with apps. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/c24cd76e1ce41366a4bbe8a49b02a028-Paper-round2.pdf.
  103. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  104. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
  105. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  106. Datamodels: Understanding predictions with data and data with predictions. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  9525–9587. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ilyas22a.html.
  107. On the complementarity of data selection and fine tuning for domain adaptation, 2021.
  108. Data-efficient finetuning using cross-task nearest neighbors. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  9036–9061, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.576. URL https://aclanthology.org/2023.findings-acl.576.
  109. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023b.
  110. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022. URL https://arxiv.org/abs/2212.12017.
  111. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  5075–5084, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.308. URL https://aclanthology.org/2023.emnlp-main.308.
  112. Matthew Jagielski. A note on interpreting canary exposure, 2023.
  113. Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data, 2022.
  114. Imputation strategies under clinical presence: Impact on algorithmic fairness. In Machine Learning for Health, pp.  12–34. PMLR, 2022.
  115. Fairness without imputation: A decision tree approach for fair prediction with missing values. In AAAI Conference on Artificial Intelligence, volume 36, pp.  9558–9566, 2022a.
  116. Who gets the benefit of the doubt? racial bias in machine learning algorithms applied to secondary school math education. In Math AI for Education: Bridging the Gap Between Research and Smart Education. AIED, 2022b.
  117. Identifying and correcting label bias in machine learning. In International Conference on Artificial Intelligence and Statistics, pp.  702–712. PMLR, 2020.
  118. D-optimality for regression designs: A review. Technometrics, 17(1):15–23, 1975. doi: 10.1080/00401706.1975.10489266. URL https://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489266.
  119. Fasttext.zip: Compressing text classification models, 2016.
  120. Data preprocessing techniques for classification without discrimination. Knowledge and information systems, 33(1):1–33, 2012.
  121. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  122. Unifying question answering, text classification, and regression via span extraction. arXiv preprint arXiv:1904.09286, 2019.
  123. UnifiedQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020. URL https://aclanthology.org/2020.findings-emnlp.171.
  124. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp.  5464–5474. PMLR, 2021a.
  125. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  8110–8118, 2021b.
  126. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023.
  127. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  128. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022. doi: 10.1162/tacl_a_00447. URL https://aclanthology.org/2022.tacl-1.4.
  129. Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662, 2023.
  130. Active instruction tuning: Improving cross-task generalization by training on prompt sensitive tasks, 2023.
  131. Ds-1000: a natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  132. The history and risks of reinforcement learning and human feedback. arXiv preprint arXiv:2310.13595, 2023a.
  133. Huggingface h4 stack exchange preference dataset, 2023b. URL https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
  134. Training subset selection for weak supervision. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  16023–16036. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/66720ca4e5a09ff83b55a117a6b2a86c-Paper-Conference.pdf.
  135. The bigscience roots corpus: A 1.6tb composite multilingual dataset. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  31809–31826. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ce9e92e3de2372a4b93353eb7f3dc0bd-Paper-Datasets_and_Benchmarks.pdf.
  136. Making large language models better data creators. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  15349–15360, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.948. URL https://aclanthology.org/2023.emnlp-main.948.
  137. Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8424–8445, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577.
  138. Dataset condensation with contrastive signals. In International Conference on Machine Learning, pp.  12352–12364. PMLR, 2022b.
  139. Diverse demonstrations improve in-context compositional generalization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1401–1422, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.78. URL https://aclanthology.org/2023.acl-long.78.
  140. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=IFXTZERXdM7.
  141. Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp.  13–24, 2021. doi: 10.1109/ICDE51399.2021.00009.
  142. Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023a. ISSN 2835-8856. URL https://openreview.net/forum?id=KoFOg41haE. Reproducibility Certification.
  143. Finding supporting examples for in-context learning. arXiv preprint arXiv:2302.13539, 2023.
  144. Unified demonstration retriever for in-context learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4644–4668, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.256. URL https://aclanthology.org/2023.acl-long.256.
  145. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, December 2022a. ISSN 1095-9203. doi: 10.1126/science.abq1158. URL http://dx.doi.org/10.1126/science.abq1158.
  146. Making something out of nothing: Building robust task-oriented dialogue systems from scratch. In Alexa Prize TaskBot Challenge 1 Proceedings, 2022b. URL https://www.amazon.science/alexa-prize/proceedings/making-something-out-of-nothing-building-robust-task-oriented-dialogue-systems-from-scratch.
  147. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
  148. Exploration with principles for diverse ai supervision. arXiv preprint arXiv:2310.08899, 2023a.
  149. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp.  100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
  150. Infini-gram: Scaling unbounded n-gram language models to a trillion tokens, 2024a.
  151. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=xbjSwwrQOe.
  152. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023b.
  153. Multi-task deep neural networks for natural language understanding. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4487–4496, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1441. URL https://aclanthology.org/P19-1441.
  154. The flan collection: Designing data and methods for effective instruction tuning, 2023a.
  155. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787, 2023b.
  156. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity, 2023c.
  157. Self: Language-driven self-evolution for large language model. arXiv preprint arXiv:2310.00533, 2023a.
  158. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models, 2023b.
  159. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  160. What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  182–189, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.24. URL https://aclanthology.org/2021.acl-short.24.
  161. Fingpt: Large generative models for a small language. arXiv preprint arXiv:2311.05640, 2023.
  162. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pp.  3384–3393. PMLR, 2018.
  163. Paloma: A benchmark for evaluating language model fit, 2023.
  164. D2 pruning: Message passing for balancing diversity and difficulty in data pruning, 2023.
  165. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi: 10.1137/0222058. URL https://doi.org/10.1137/0222058.
  166. Impact of missing data imputation on the fairness and accuracy of graph node classifiers. In IEEE International Conference on Big Data, pp.  5988–5997, 2022.
  167. Data portraits: Recording foundation model training data, 2023.
  168. Which examples to annotate for in-context learning? towards effective and efficient selection. arXiv preprint arXiv:2310.20046, 2023.
  169. Dataperf: Benchmarks for data-centric AI development. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=LaFKTgrZMG.
  170. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
  171. Data curation: A study of researcher practices and needs. portal: Libraries and the Academy, 14(2):139–164, 2014.
  172. MetaICL: Learning to learn in context. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2791–2809, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.201. URL https://aclanthology.org/2022.naacl-main.201.
  173. Prioritized training on points that are learnable, worth learning, and not yet learnt. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  15630–15649. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/mindermann22a.html.
  174. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp.  6950–6960. PMLR, 2020a.
  175. Coresets for robust training of deep neural networks against noisy labels. Advances in Neural Information Processing Systems, 33:11465–11477, 2020b.
  176. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  177. Measuring data, 2023.
  178. Intelligent selection of language model training data. In Jan Hajič, Sandra Carberry, Stephen Clark, and Joakim Nivre (eds.), Proceedings of the ACL 2010 Conference Short Papers, pp.  220–224, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL https://aclanthology.org/P10-2041.
  179. Chenghao Mou. Large-scale near-deduplication behind bigcode, May 2023. URL https://huggingface.co/blog/dedup. Accessed: 2023-12-06.
  180. Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904, 2022.
  181. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022a.
  182. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022b.
  183. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023a.
  184. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=j5BuTrEj35.
  185. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
  186. Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020.
  187. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  188. Can foundation models wrangle your data?, 2022.
  189. In-context example selection with influences. arXiv preprint arXiv:2302.11042, 2023.
  190. Quality not quantity: On the interaction between dataset design and robustness of CLIP. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=LTCBavFWp5C.
  191. Dataset meta-learning from kernel ridge-regression. In International Conference on Learning Representations, 2020.
  192. Dataset distillation with infinitely wide convolutional networks. Advances in Neural Information Processing Systems, 34:5186–5198, 2021.
  193. Gpt-4 technical report, 2023.
  194. Distributionally robust language modeling. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4227–4237, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1432. URL https://aclanthology.org/D19-1432.
  195. Proving test set contamination in black box language models, 2023.
  196. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, and Caroline Iliadi (eds.), Proceedings of the Workshop on Challenges in the Management of Large Corpora, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pp.  9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
  197. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022. URL https://arxiv.org/abs/2203.02155.
  198. West-of-n: Synthetic preference generation for improved reward modeling, 2024.
  199. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  3806–3824, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.248. URL https://aclanthology.org/2023.findings-emnlp.248.
  200. Trak: Attributing model behavior at scale, 2023.
  201. Deep learning on a data diet: Finding important examples early in training. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Uj7pF-D-YvT.
  202. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  203. True few-shot learning with language models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=ShnM-rRh4T.
  204. Deep contextualized word representations. NAACL, 2018. URL https://aclanthology.org/N18-1202.
  205. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks, 2019.
  206. Dynamic pretraining of vision-language models. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. URL https://openreview.net/forum?id=meQWVbMCqXr.
  207. GAIA search: Hugging face and pyserini interoperability for NLP training data exploration. In Danushka Bollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp.  588–598, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.57. URL https://aclanthology.org/2023.acl-demo.57.
  208. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 33:17044–17056, 2020.
  209. Adaptive second order coresets for data-efficient machine learning. In International Conference on Machine Learning, pp.  17848–17869. PMLR, 2022.
  210. Intermediate-task transfer learning with pretrained language models: When and why does it work? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5231–5247, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.467. URL https://aclanthology.org/2020.acl-main.467.
  211. Improving language understanding by generative pre-training, 2018. URL https://api.semanticscholar.org/CorpusID:49313245.
  212. Language models are unsupervised multitask learners, 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  213. Learning transferable visual models from natural language supervision, 2021.
  214. Scaling language models: Methods, analysis & insights from training gopher, 2022.
  215. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  216. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435. URL https://jmlr.org/papers/volume21/20-074/20-074.pdf.
  217. No robots. https://huggingface.co/datasets/HuggingFaceH4/no_robots, 2023.
  218. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
  219. Learning to retrieve prompts for in-context learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2655–2671, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main.191.
  220. Distributionally robust neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryxGuJrFvS.
  221. FAWOS: Fairness-Aware Oversampling Algorithm Based on Distributions of Sensitive Attributes. IEEE Access, 9:81370–81379, 2021. ISSN 2169-3536. doi: 10.1109/ACCESS.2021.3084121. Conference Name: IEEE Access.
  222. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  223. Paul tremblay, mona awad vs. openai, inc., et al., 2023. URL https://storage.courtlistener.com/recap/gov.uscourts.cand.414822/gov.uscourts.cand.414822.1.0_1.pdf. Case 3:23-cv-03223-AMO Document 1 Filed 06/28/23, UNITED STATES DISTRICT COURT, NORTHERN DISTRICT OF CALIFORNIA, SAN FRANCISCO DIVISION.
  224. Teven Le Scao. Scaling multilingual language models under constrained data. PhD thesis, Université de Lorraine, 2023.
  225. What language model to train if you have one million gpu hours? arXiv preprint arXiv:2210.15424, 2022.
  226. Fairprep: Promoting data to a first-class citizen in studies on fairness-enhancing interventions. arXiv preprint arXiv:1911.12587, 2019.
  227. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Yacmpz84TH.
  228. Cross-lingual supervision improves large language models pre-training, 2023.
  229. Apricot: Submodular selection for data summarization in python. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
  230. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
  231. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1aIuk-RW.
  232. Text data acquisition for domain-specific language models. In Dan Jurafsky and Eric Gaussier (eds.), Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp.  382–389, Sydney, Australia, July 2006. Association for Computational Linguistics. URL https://aclanthology.org/W06-1645.
  233. Detecting pretraining data from large language models, 2023a.
  234. Effective robustness against natural distribution shifts for models with different training data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=PAYXfIUKWY.
  235. Aya dataset: An open-access collection for multilingual instruction tuning, 2024.
  236. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013.
  237. Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.
  238. Carpe diem, seize the samples uncertain "at the moment" for adaptive batch selection. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, pp.  1385–1394, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450368599. doi: 10.1145/3340531.3411898. URL https://doi.org/10.1145/3340531.3411898.
  239. Ryosuke Sonoda. Fair oversampling technique using heterogeneous clusters. Information Sciences, 640:119059, September 2023. ISSN 0020-0255. doi: 10.1016/j.ins.2023.119059. URL https://www.sciencedirect.com/science/article/pii/S0020025523006448.
  240. Beyond neural scaling laws: beating power law scaling via data pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=UmvSlP-PyV.
  241. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  242. Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=qY1hlv7gwg.
  243. Detecting personal information in training corpora: an analysis. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pp.  208–220, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.trustnlp-1.18. URL https://aclanthology.org/2023.trustnlp-1.18.
  244. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9275–9293, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.746. URL https://aclanthology.org/2020.emnlp-main.746.
  245. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  246. Data curation with deep learning. In EDBT, pp.  277–286, 2020.
  247. D4: Improving llm pretraining via document de-duplication and diversification, 2023.
  248. A reproduction of apple’s bi-directional LSTM models for language identification in short strings. In Ionut-Teodor Sorodoc, Madhumita Sushil, Ece Takmaz, and Eneko Agirre (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.  36–42, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-srw.6. URL https://aclanthology.org/2021.eacl-srw.6.
  249. Llama: Open and efficient foundation language models, 2023a.
  250. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  251. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  252. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827, 2024.
  253. Writing system and speaker metadata for 2,800+ language varieties. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  5035–5046, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.538.
  254. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353–355, Brussels, Belgium, November 2018a. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446.
  255. Training data selection for support vector machines. In Lipo Wang, Ke Chen, and Yew Soon Ong (eds.), Advances in Natural Computation, pp.  554–564, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-31853-8.
  256. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12196–12205, 2022a.
  257. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  5310–5319, 2019.
  258. Shepherd: A critic for language model generation, 2023a.
  259. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018b.
  260. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=BGvkwZEGt7.
  261. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
  262. How far can camels go? exploring the state of instruction tuning on open resources, 2023c.
  263. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023d. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  264. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 2019.
  265. Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, and Ryan Cotterell (eds.), Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pp.  1–34, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.conll-babylm.1. URL https://aclanthology.org/2023.conll-babylm.1.
  266. Finetuned language models are zero-shot learners. ICLR 2022, 2021. URL https://openreview.net/forum?id=gEZrGCozdqR.
  267. Challenges in detoxifying language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  2447–2469, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.210. URL https://aclanthology.org/2021.findings-emnlp.210.
  268. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  4003–4012, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.494.
  269. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
  270. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  271. Scattershot: Interactive in-context example curation for text transformation. In Proceedings of the 28th International Conference on Intelligent User Interfaces, IUI ’23, pp.  353–367, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701061. doi: 10.1145/3581641.3584059. URL https://doi.org/10.1145/3581641.3584059.
  272. Sheared llama: Accelerating language model pre-training via structured pruning, 2023a.
  273. Less: Selecting influential data for targeted instruction tuning, 2024.
  274. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=7D5EECbOaf9.
  275. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
  276. Doremi: Optimizing data mixtures speeds up language model pretraining. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=lXuByUeHhd.
  277. Data selection for language models via importance resampling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=uPSQv0leAu.
  278. Detoxifying language models risks marginalizing minority voices. arXiv preprint arXiv:2104.06390, 2021.
  279. Curriculum learning for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6095–6104, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.542. URL https://aclanthology.org/2020.acl-main.542.
  280. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  281. Misconfidence-based demonstration selection for llm in-context learning. arXiv preprint arXiv:2401.06301, 2024.
  282. Perils of self-feedback: Self-bias amplifies in large language models, 2024a.
  283. In-context learning with retrieved demonstrations for language models: A survey. arXiv preprint arXiv:2401.11624, 2024b.
  284. mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
  285. Fair Class Balancing: Enhancing Model Fairness without Observing Sensitive Attributes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, pp.  1715–1724, New York, NY, USA, October 2020. Association for Computing Machinery. ISBN 978-1-4503-6859-9. doi: 10.1145/3340531.3411980. URL https://dl.acm.org/doi/10.1145/3340531.3411980.
  286. Fairness with overlapping groups; a probabilistic perspective. Advances in Neural Information Processing Systems, 33, 2020a.
  287. Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp.  547–558, Barcelona Spain, January 2020b. ACM. ISBN 978-1-4503-6936-7. doi: 10.1145/3351095.3375709. URL https://dl.acm.org/doi/10.1145/3351095.3375709.
  288. Image data augmentation for deep learning: A survey, 2023.
  289. Compositional exemplars for in-context learning. arXiv preprint arXiv:2302.05698, 2023.
  290. Bloom+ 1: Adding language support to bloom for zero-shot prompting. arXiv preprint arXiv:2212.09535, 2022.
  291. Real-fake: Effective training data synthesis through distribution matching. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=svIdLLZpsA.
  292. Self-rewarding language models, 2024b.
  293. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
  294. Mc^ 2: A multilingual corpus of minority languages in china. arXiv preprint arXiv:2311.08348, 2023a.
  295. Ideal: Influence-driven selective annotations empower in-context learners in large language models. arXiv preprint arXiv:2310.10873, 2023b.
  296. Instruction tuning for large language models: A survey, 2023c.
  297. Yiliang Zhang and Qi Long. Fairness in missing data imputation. arXiv preprint arXiv:2110.12002, 2021.
  298. Active example selection for in-context learning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9134–9148, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.622. URL https://aclanthology.org/2022.emnlp-main.622.
  299. Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pp.  12674–12685. PMLR, 2021.
  300. Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  6514–6523, 2023.
  301. Dataset condensation with gradient matching. In International Conference on Learning Representations, 2020.
  302. Improved distribution matching for dataset condensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7856–7865, 2023.
  303. Coverage-centric coreset selection for high pruning rates. In The Eleventh International Conference on Learning Representations, 2022.
  304. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=KBMOKmX2he.
  305. Dataset distillation using neural feature regression. Advances in Neural Information Processing Systems, 35:9813–9827, 2022.
  306. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023a.
  307. Multimodal c4: An open, billion-scale corpus of images interleaved with text, 2023b.
  308. Astraios: Parameter-efficient instruction tuning code large language models. arXiv preprint arXiv:2401.00788, 2024.
  309. Fine-tuning language models from human preferences, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Alon Albalak (26 papers)
  2. Yanai Elazar (44 papers)
  3. Sang Michael Xie (21 papers)
  4. Shayne Longpre (49 papers)
  5. Nathan Lambert (37 papers)
  6. Xinyi Wang (152 papers)
  7. Niklas Muennighoff (56 papers)
  8. Bairu Hou (14 papers)
  9. Liangming Pan (59 papers)
  10. Haewon Jeong (17 papers)
  11. Colin Raffel (83 papers)
  12. Shiyu Chang (120 papers)
  13. Tatsunori Hashimoto (80 papers)
  14. William Yang Wang (254 papers)
Citations (80)

HackerNews