Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models (2405.03869v4)
Abstract: A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning LLMs.
- Dmlr: Data-centric machine learning research–past, present and future. arXiv preprint arXiv:2311.13028, 2023.
- Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158, 2023.
- What Data Benefits My Classifier? Enhancing Model Performance and Interpretability through Influence-Based Data Selection. In International Conference on Learning Representations, 2024.
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models. In International Conference on Learning Representations, 2024.
- Training data influence analysis and estimation: A survey. arXiv preprint arXiv:2212.04612, 2022.
- Residuals and influence in regression. New York: Chapman and Hall, 1982.
- Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, 2019.
- Towards efficient data valuation based on the shapley value. In International Conference on Artificial Intelligence and Statistics, 2019.
- Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017.
- Influence Functions in Deep Learning Are Fragile. In International Conference on Learning Representations, 2020a.
- Best subset selection via a modern optimization lens. The Annals of Statistics, 2016.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Beta Shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, 2022.
- Efficient task specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment, 2018.
- Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 2017.
- Representer point selection for explaining deep neural networks. Advances in Neural Information Processing Systems, 2018.
- Hydra: Hypergradient data relevance analysis for interpreting deep neural networks. In AAAI Conference on Artificial Intelligence, 2021.
- Scaling up influence functions. In AAAI Conference on Artificial Intelligence, 2022.
- What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 2020.
- Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 2020.
- Input similarity from the neural network perspective. Advances in Neural Information Processing Systems, 2019.
- Resolving training biases via influence-based data relabeling. In International Conference on Learning Representations, 2021.
- Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
- Theoretical and practical perspectives on what influence functions do. Advances in Neural Information Processing Systems, 2024.
- Make every example count: On the stability and utility of self-influence for learning from noisy nlp datasets. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Self-influence guided data reweighting for language model pre-training. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Datamodels: Predicting predictions from training data. arXiv preprint arXiv:2202.00622, 2022.
- Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks. In Advances in Neural Information Processing Systems, 2023.
- Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems, 2021.
- Retrieve: Coreset selection for efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 2021.
- Data pruning via moving-one-sample-out. Advances in Neural Information Processing Systems, 36, 2024.
- Dataset pruning: Reducing training data by examining generalization influence. In The Eleventh International Conference on Learning Representations, 2022.
- Deeper understanding of black-box predictions via generalized influence functions. arXiv preprint arXiv:2312.05586, 2023.
- Fair clustering using antidote data. In Algorithmic Fairness through the Lens of Causality and Robustness Workshop, 2022.
- Learning antidote data to individual unfairness. In International Conference on Machine Learning, 2023.
- Mark A Hall. Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato, 1999.
- Feature selection in machine learning: A new perspective. Neurocomputing, 2018.
- Active learning with statistical models. Journal of Artificial Intelligence Research, 1996.
- Influence selection for active learning. In IEEE/CVF International Conference on Computer Vision, 2021.
- How to measure uncertainty in uncertainty sampling for active learning. Machine Learning, 2022.
- Submodularity in data subset selection and active learning. In International Conference on Machine Learning, 2015.
- Poisoning attacks on algorithmic fairness. In Machine Learning and Knowledge Discovery in Databases: European Conference, 2021.
- Exacerbating algorithmic bias through fairness attacks. In AAAI Conference on Artificial Intelligence, 2021.
- Robust fair clustering: A novel fairness attack and defense framework. In International Conference on Learning Representations, 2023.
- Training data attribution for diffusion models. arXiv preprint arXiv:2306.02174, 2023.
- Frank R Hampel. The influence curve and its role in robust estimation. Journal of the american statistical association, 1974.
- Influence functionals for time series. The Annals of Statistics, 1986.
- Explaining black box predictions and unveiling data artifacts through influence functions. In Annual Meeting of the Association for Computational Linguistics, 2020.
- On second-order group influence functions for black-box predictions. In International Conference on Machine Learning, 2020b.
- Ahmed Alaa and Mihaela Van Der Schaar. Discriminative jackknife: Quantifying uncertainty in deep learning via higher-order influence functions. In International Conference on Machine Learning, 2020.
- Isolation forest. In IEEE International Conference on Data Mining, 2008.
- Distance-based outliers: algorithms and applications. The VLDB Journal, 2000.
- Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 2021.
- Identifying mislabeled instances in classification datasets. In International Joint Conference on Neural Networks, 2019.
- Model-agnostic label quality scoring to detect real-world label errors. In ICML DataPerf Workshop, 2022.
- Learning with noisy labels revisited: A study using real-world human annotations. In International Conference on Learning Representations, 2022.
- Learning multiple layers of features from tiny images. University of Toronto, 2009.
- Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv preprint arXiv:1804.07461., 2018.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations, 2022.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
- Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
- Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 287–296, 2006.