Data Valuation with Gradient Similarity

Published 13 May 2024 in cs.LG, q-bio.GN, q-bio.QM, and stat.ML | (2405.08217v1)

Abstract: High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains. Distinguishing low- from high-quality data can be challenging, often requiring expert knowledge and considerable manual intervention. Data Valuation algorithms are a class of methods that seek to quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task. These data values have shown an impressive ability to identify mislabeled observations, and filtering low-value data can boost machine learning performance. In this work, we present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS). This approach can be easily applied to any gradient descent learning algorithm, scales well to large datasets, and performs comparably or better than baseline valuation methods for tasks such as corrupted label discovery and noise quantification. We evaluate the DVGS method on tabular, image and RNA expression datasets to show the effectiveness of the method across domains. Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.

Abstract PDF HTML Upgrade to Chat

References (41)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces DVGS, a gradient similarity-based method that computes data value by comparing gradients between source and target datasets.
It demonstrates improved scalability and effectiveness in detecting mislabeled and noisy data compared to methods like Data Shapley and DVRL.
Applied to datasets including LINCS L1000, DVGS enhances predictive performance by efficiently filtering low-quality samples.

Data Valuation with Gradient Similarity

Abstract

The paper "Data Valuation with Gradient Similarity" introduces a novel algorithm termed Data Valuation with Gradient Similarity (DVGS), aimed at quantifying the value of data samples based on their usefulness to predictive tasks. This method is designed to address data quality issues prevalent across various domains, offering an automated approach that scales effectively for large datasets, particularly in high-throughput settings. The DVGS algorithm targets tasks including corrupted label discovery and noise quantification, demonstrating comparable or superior performance to established methods such as Data Shapley and Data Valuation using Reinforcement Learning (DVRL).

Background

Importance of Data Valuation

Data valuation is pivotal in machine learning, where data quality significantly impacts model performance. Techniques that evaluate the informativeness of individual samples can identify mislabeled or noisy data, enhancing the predictive capability of models. While existing methods like Leave-One-Out (LOO), Data Shapley, and DVRL exhibit certain strengths, they often face limitations in terms of scalability and robustness to hyperparameter choices. For instance, Data Shapley provides an equitable framework but struggles with large datasets. DVRL offers competitive performance but requires extensive hyperparameter tuning.

Application to LINCS Dataset

The DVGS method is evaluated on various datasets, including the LINCS L1000, a large-scale transcriptomic dataset notable for its data quality challenges. Enhancing data quality in such datasets can significantly boost their utility, particularly in fields like drug development. The DVGS approach offers an alternative metric to previously established methods, such as the Average Pearson Correlation (APC), providing insights into sample quality that can inform data acquisition strategies.

Figure 1: We propose a method of data valuation that compares each source sample to the target samples by computing the similarity of gradients during stochastic gradient descent.

Methodology

DVGS Algorithm

The DVGS algorithm is designed to compute data values by comparing gradient similarities between source and target datasets during stochastic gradient descent (SGD). It operates under the hypothesis that samples contributing to a loss landscape similar to the target are more valuable. This method requires a differentiable predictive model optimized using SGD and performs gradient comparisons at select parameter points, focusing on crucial regions in weight-space.

Algorithm: The DVGS procedure involves sampling mini-batches from a target dataset, computing batch gradients, and evaluating their similarity against source gradients. Cosine similarity serves as the metric for gradient comparison, neglecting magnitude differences and focusing solely on directional alignment.

Complexity and Scalability

DVGS proves efficient in environments where the source dataset is substantially larger than the target dataset. The algorithm scales linearly with the number of source samples and iterations, with impressive runtime improvements compared to DVRL and Data Shapley. This efficiency makes the method viable for extensive datasets, where speed and scalability are crucial.

Results

Supervised and Unsupervised Learning

DVGS is assessed on multiple datasets (ADULT, BLOG, CIFAR10) using different corruption scenarios. It demonstrates robust performance in identifying corrupted labels and characterizing noise, outperforming baseline methods in most tasks. The method is particularly effective when applied to high-dimensional image data utilizing pretrained models.

Figure 2: DVGS runtime on the ADULT dataset when computing gradient similarities every T steps.

Application to LINCS L1000

Applying DVGS to LINCS L1000 data revealed that the algorithm could effectively quantify sample quality, enabling informed filtering that enhances model performance. DVGS values were notably better at identifying low-quality data compared to APC values, highlighting their potential in guiding future data acquisition and curation.

Conclusion

The DVGS method presents a scalable, robust approach to data valuation, with applications spanning various domains and dataset sizes. Its ability to efficiently compute data values and improve predictive performance by filtering data underscores its practical significance. Future research should investigate extensions that further merge the interpretability advantages of Data Shapley with the scalability of DVGS, possibly through learned transformations between value metrics. Additionally, exploring class balance techniques and redundancy management could enhance DVGS's utility in diverse settings.

Markdown