Proper Dataset Valuation by Pointwise Mutual Information (2405.18253v3)
Abstract: Data plays a central role in advancements in modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of principled and effective data curation methods in recent years. However, existing methods largely rely on heuristics, and whether they are truly effective remains unclear. For instance, standard evaluation methods that assess a trained model's performance on specific benchmarks may incentivize assigning high scores to data that merely resembles the test set. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this issue, we propose an information-theoretic framework for evaluating data curation methods. We define dataset quality in terms of its informativeness about the true model parameters, formalized using the Blackwell ordering of informativeness. Under this ordering, Blackwell's theorem ensures that more informative data yields optimal models with lower expected loss on the true underlying distribution. To measure informativeness, we show that the Blackwell order can be determined by the Shannon mutual information between the curated data and the test data. To estimate this mutual information, we introduce a novel method that trains Bayesian models on embedded datasets and computes mutual information from the posteriors of model parameters. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.
- Weight uncertainty in neural network. In International conference on machine learning, pages 1613–1622. PMLR, 2015.
- Why does throwing away data improve worst-group error? In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Truthful data acquisition via peer prediction. Advances in Neural Information Processing Systems, 33:18194–18204, 2020.
- What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
- Bayesian data analysis. Chapman and Hall/CRC, 1995.
- Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
- Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359–378, 2007.
- I. J. Good. Rational decisions. Journal of the Royal Statistical Society. Series B (Methodological), 14(1):107–114, 1952. ISSN 00359246. URL http://www.jstor.org/stable/2984087.
- Efficient task-specific data valuation for nearest neighbor algorithms. arXiv preprint arXiv:1908.08619, 2019.
- Opendataval: a unified benchmark for data valuation. arXiv preprint arXiv:2306.10577, 2023.
- Incentives for expressing opinions in online polls. In Proceedings of the 9th ACM Conference on Electronic Commerce, pages 119–128, 2008.
- Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
- Equilibrium selection in information elicitation without verification via information monotonicity. In 9th Innovations in Theoretical Computer Science Conference, 2018a.
- Water from two rocks: Maximizing the mutual information. In Proceedings of the 2018 ACM Conference on Economics and Computation, EC ’18, page 177–194, New York, NY, USA, 2018b. Association for Computing Machinery. ISBN 9781450358293. doi: 10.1145/3219166.3219194. URL https://doi.org/10.1145/3219166.3219194.
- Beta shapley: a unified and noise-reduced data valuation framework for machine learning. arXiv preprint arXiv:2110.14049, 2021.
- Datainf: Efficiently estimating data influence in lora-tuned llms and diffusion models. arXiv preprint arXiv:2310.00902, 2023.
- Eliciting informative feedback: The peer-prediction method. Management Science, pages 1359–1373, 2005.
- Kevin P Murphy. Machine learning: a probabilistic perspective. 2012.
- Frank Nielsen. On the jensen–shannon symmetrization of distances relying on abstract means. Entropy, 21(5):485, 2019.
- Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186, 2023.
- D. Prelec. A Bayesian Truth Serum for subjective data. Science, 306(5695):462–466, 2004.
- A robust bayesian truth serum for non-binary signals. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI” 13), number EPFL-CONF-197486, pages 833–839, 2013.
- Incentives for truthful information elicitation of continuous signals. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI” 14), number EPFL-CONF-215878, pages 770–776, 2014.
- Two strongly truthful mechanisms for three heterogeneous agents answering one question. In International Conference on Web and Internet Economics. Springer, 2020.
- Luke Tierney. Markov Chains for Exploring Posterior Distributions. The Annals of Statistics, 22(4):1701 – 1728, 1994. doi: 10.1214/aos/1176325750. URL https://doi.org/10.1214/aos/1176325750.
- Kernel smoothing. CRC press, 1994.
- Data banzhaf: A data valuation framework with maximal robustness to learning stochasticity. arXiv preprint arXiv:2205.15466, 2022.
- Peer prediction without a common prior. In Boi Faltings, Kevin Leyton-Brown, and Panos Ipeirotis, editors, Proceedings of the 13th ACM Conference on Electronic Commerce, EC 2012, Valencia, Spain, June 4-8, 2012, pages 964–981. ACM, 2012. doi: 10.1145/2229012.2229085. URL https://doi.org/10.1145/2229012.2229085.