Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 88 tok/s
GPT OSS 120B 471 tok/s Pro
Kimi K2 220 tok/s Pro
2000 character limit reached

Dataset Difficulty and the Role of Inductive Bias (2401.01867v1)

Published 3 Jan 2024 in cs.LG

Abstract: Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examples within a dataset. These methods, which we call "example difficulty scores", are typically used to rank or categorize examples, but the consistency of rankings between different training runs, scoring methods, and model architectures is generally unknown. To determine how example rankings vary due to these random and controlled effects, we systematically compare different formulations of scores over a range of runs and model architectures. We find that scores largely share the following traits: they are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. Drawing from statistical genetics, we develop a simple method for fingerprinting model architectures using a few sensitive examples. These findings guide practitioners in maximizing the consistency of their scores (e.g. by choosing appropriate scoring methods, number of runs, and subsets of examples), and establishes comprehensive baselines for evaluating scores in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Estimating example difficulty using variance of gradients. arXiv preprint arXiv:2008.11600.
  2. Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34.
  3. Semantic redundancies in image-classification datasets: The 10% you don’t need. arXiv preprint arXiv:1901.11409.
  4. Distribution density, tails, and outliers in machine learning: Metrics and applications. arXiv preprint arXiv:1910.13427.
  5. Basic statistical analysis in genetic case-control studies. Nature protocols, 6(2):121–133.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  7. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891.
  8. Let’s agree to agree: Neural networks share classification order on real datasets. In International Conference on Machine Learning, pages 3950–3960. PMLR.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  10. What do compressed deep neural networks forget? arXiv preprint arXiv:1911.05248.
  11. Characterising bias in compressed models. arXiv preprint arXiv:2010.03058.
  12. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762.
  13. Predicting the generalization gap in deep networks with margin distributions. In International Conference on Learning Representations.
  14. Characterizing structural regularities of labeled data in overparameterized models. arXiv preprint arXiv:2002.03206.
  15. Pruning’s effect on generalization through the lens of training and regularization. Advances in Neural Information Processing Systems, 35:37947–37961.
  16. Sgd on neural networks learns functions of increasing complexity. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  17. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
  18. Understanding memorization from the perspective of optimization via efficient influence estimation. arXiv preprint arXiv:2112.08798.
  19. Trivial or impossible–dichotomous data difficulty masks model differences (on imagenet and beyond). arXiv preprint arXiv:2110.05922.
  20. Data valuation without training of a model. In International Conference on Learning Representations.
  21. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411.
  22. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34.
  23. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 33:17044–17056.
  24. Estimating training data influence by tracing gradient descent. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 19920–19930. Curran Associates, Inc.
  25. Metadata archaeology: Unearthing data subsets by leveraging training dynamics. arXiv preprint arXiv:2209.10015.
  26. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations (ICLR 2015). Computational and Biological Learning Society.
  27. Beyond neural scaling laws: beating power law scaling via data pruning. arXiv preprint arXiv:2206.14486.
  28. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293.
  29. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations.
  30. Multi-trait analysis of genome-wide association summary statistics using mtag. Nature genetics, 50(2):229–237.
  31. Identifying spurious biases early in training through the lens of simplicity bias. arXiv preprint arXiv:2305.18761.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube