Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data Valuation with Gradient Similarity

Published 13 May 2024 in cs.LG, q-bio.GN, q-bio.QM, and stat.ML | (2405.08217v1)

Abstract: High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains. Distinguishing low- from high-quality data can be challenging, often requiring expert knowledge and considerable manual intervention. Data Valuation algorithms are a class of methods that seek to quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task. These data values have shown an impressive ability to identify mislabeled observations, and filtering low-value data can boost machine learning performance. In this work, we present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS). This approach can be easily applied to any gradient descent learning algorithm, scales well to large datasets, and performs comparably or better than baseline valuation methods for tasks such as corrupted label discovery and noise quantification. We evaluate the DVGS method on tabular, image and RNA expression datasets to show the effectiveness of the method across domains. Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. A multi-center study on the reproducibility of drug-response assays in mammalian cell lines. Cell Systems, 9(1):35–48.e5, 2019.
  2. Drug development: Raise standards for preclinical cancer research. Nature, 483(7391):531—533, March 2012.
  3. Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10:712–712, 2011.
  4. L Cheng and L Li. Systematic quality control analysis of LINCS data. 5(11):588–598.
  5. A review of data quality assessment methods for public health information systems. International journal of environmental research and public health, 11(5):5170–5207, 2014.
  6. The effects of data quality on machine learning performance, 2022.
  7. Li Cai and Yangyong Zhu. The challenges of data quality and data quality assessment in the big data era. Data Sci. J., 14:2, 2015.
  8. Data shapley: Equitable valuation of data for machine learning.
  9. Data valuation using reinforcement learning. Number: arXiv:1909.11671.
  10. R. Dennis Cook. Detection of influential observation in linear regression. 19(1):15–18. Publisher: [Taylor & Francis, Ltd., American Statistical Association, American Society for Quality].
  11. Data valuation for medical imaging using shapley value and application to a large-scale chest x-ray dataset. 11(1):8366. Number: 1 Publisher: Nature Publishing Group.
  12. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171:1437–1452.e17, 2017.
  13. A Bayesian approach to accurate and robust signature detection on LINCS L1000 data. Bioinformatics, 36(9):2787–2795, 01 2020.
  14. l1kdeconv: an r package for peak calling analysis with lincs l1000 data. BMC Bioinformatics, 18, 2017.
  15. The characteristic direction: a geometrical approach to identify differentially expressed genes. BMC Bioinformatics, 15, 2014.
  16. L1000cds2: Lincs l1000 characteristic direction signatures search engine. NPJ Systems Biology and Applications, 2, 2016.
  17. A deep learning framework for high-throughput mechanism-driven phenotype compound screening and its application to COVID-19 drug repurposing. 3(3):247–257. Number: 3 Publisher: Nature Publishing Group.
  18. Dataset distillation, 2018.
  19. Dataset distillation: A comprehensive review, 2023.
  20. Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 1121–1128, New York, NY, USA, 2009. Association for Computing Machinery.
  21. Super-samples from kernel herding, 2012.
  22. Coresets for nonparametric estimation - the case of dp-means. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 209–217, Lille, France, 07–09 Jul 2015. PMLR.
  23. Scalable training of mixture models via coresets. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  24. Coresets for data-efficient training of machine learning models. 2019.
  25. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning, 2022.
  26. Data selection for language models via importance resampling, 2023.
  27. Deep learning for anomaly detection. ACM Computing Surveys, 54(2):1–38, mar 2021.
  28. Backpropagated gradient representations for anomaly detection, 2020.
  29. Learning to reweight examples for robust deep learning, 2018.
  30. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. 2017.
  31. Generalized cross entropy loss for training deep neural networks with noisy labels. ArXiv, abs/1805.07836, 2018.
  32. Using trusted data to train deep networks on labels corrupted by severe noise. ArXiv, abs/1802.05300, 2018.
  33. Langche Zeng. Logistic regression in rare events data 1. 1999.
  34. D. Dua and C. Graff. UCI machine learning repository.
  35. Krisztián Búza. Feedback prediction for blogs. In Annual Conference of the Gesellschaft für Klassifikation, 2012.
  36. Alex Krizhevsky. Learning multiple layers of features from tiny images. pages 32–33, 2009.
  37. Going deeper with convolutions, 2014.
  38. C. Spearman. The proof and measurement of association between two things. by c. spearman, 1904. The American journal of psychology, 100 3-4:441–71, 1987.
  39. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
  40. Learning Internal Representations by Error Propagation, pages 318–362. 1987.
  41. Pierre Baldi. Autoencoders, unsupervised learning, and deep architectures. In Isabelle Guyon, Gideon Dror, Vincent Lemaire, Graham Taylor, and Daniel Silver, editors, Proceedings of ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proceedings of Machine Learning Research, pages 37–49, Bellevue, Washington, USA, 02 Jul 2012. PMLR.
Citations (1)

Summary

  • The paper introduces DVGS, a gradient similarity-based method that computes data value by comparing gradients between source and target datasets.
  • It demonstrates improved scalability and effectiveness in detecting mislabeled and noisy data compared to methods like Data Shapley and DVRL.
  • Applied to datasets including LINCS L1000, DVGS enhances predictive performance by efficiently filtering low-quality samples.

Data Valuation with Gradient Similarity

Abstract

The paper "Data Valuation with Gradient Similarity" introduces a novel algorithm termed Data Valuation with Gradient Similarity (DVGS), aimed at quantifying the value of data samples based on their usefulness to predictive tasks. This method is designed to address data quality issues prevalent across various domains, offering an automated approach that scales effectively for large datasets, particularly in high-throughput settings. The DVGS algorithm targets tasks including corrupted label discovery and noise quantification, demonstrating comparable or superior performance to established methods such as Data Shapley and Data Valuation using Reinforcement Learning (DVRL).

Background

Importance of Data Valuation

Data valuation is pivotal in machine learning, where data quality significantly impacts model performance. Techniques that evaluate the informativeness of individual samples can identify mislabeled or noisy data, enhancing the predictive capability of models. While existing methods like Leave-One-Out (LOO), Data Shapley, and DVRL exhibit certain strengths, they often face limitations in terms of scalability and robustness to hyperparameter choices. For instance, Data Shapley provides an equitable framework but struggles with large datasets. DVRL offers competitive performance but requires extensive hyperparameter tuning.

Application to LINCS Dataset

The DVGS method is evaluated on various datasets, including the LINCS L1000, a large-scale transcriptomic dataset notable for its data quality challenges. Enhancing data quality in such datasets can significantly boost their utility, particularly in fields like drug development. The DVGS approach offers an alternative metric to previously established methods, such as the Average Pearson Correlation (APC), providing insights into sample quality that can inform data acquisition strategies. Figure 1

Figure 1: We propose a method of data valuation that compares each source sample to the target samples by computing the similarity of gradients during stochastic gradient descent.

Methodology

DVGS Algorithm

The DVGS algorithm is designed to compute data values by comparing gradient similarities between source and target datasets during stochastic gradient descent (SGD). It operates under the hypothesis that samples contributing to a loss landscape similar to the target are more valuable. This method requires a differentiable predictive model optimized using SGD and performs gradient comparisons at select parameter points, focusing on crucial regions in weight-space.

Algorithm: The DVGS procedure involves sampling mini-batches from a target dataset, computing batch gradients, and evaluating their similarity against source gradients. Cosine similarity serves as the metric for gradient comparison, neglecting magnitude differences and focusing solely on directional alignment.

Complexity and Scalability

DVGS proves efficient in environments where the source dataset is substantially larger than the target dataset. The algorithm scales linearly with the number of source samples and iterations, with impressive runtime improvements compared to DVRL and Data Shapley. This efficiency makes the method viable for extensive datasets, where speed and scalability are crucial.

Results

Supervised and Unsupervised Learning

DVGS is assessed on multiple datasets (ADULT, BLOG, CIFAR10) using different corruption scenarios. It demonstrates robust performance in identifying corrupted labels and characterizing noise, outperforming baseline methods in most tasks. The method is particularly effective when applied to high-dimensional image data utilizing pretrained models. Figure 2

Figure 2

Figure 2: DVGS runtime on the ADULT dataset when computing gradient similarities every T steps.

Application to LINCS L1000

Applying DVGS to LINCS L1000 data revealed that the algorithm could effectively quantify sample quality, enabling informed filtering that enhances model performance. DVGS values were notably better at identifying low-quality data compared to APC values, highlighting their potential in guiding future data acquisition and curation.

Conclusion

The DVGS method presents a scalable, robust approach to data valuation, with applications spanning various domains and dataset sizes. Its ability to efficiently compute data values and improve predictive performance by filtering data underscores its practical significance. Future research should investigate extensions that further merge the interpretability advantages of Data Shapley with the scalability of DVGS, possibly through learned transformations between value metrics. Additionally, exploring class balance techniques and redundancy management could enhance DVGS's utility in diverse settings.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.