Datamodels: Predicting Predictions from Training Data (2202.00622v1)

Published 1 Feb 2022 in stat.ML, cs.CV, and cs.LG

Abstract: We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed "target" example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using only information about which examples of $S$ are contained in $S'$ -- predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the potential complexity of the underlying process being approximated (e.g., end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels can successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. Data for this paper (including pre-computed datamodels as well as raw predictions from four million trained deep neural networks) is available at https://github.com/MadryLab/datamodels-data .

Citations (117)

View on Semantic Scholar

Summary

The paper introduces datamodels—a framework that leverages linear regression on training subsets to predict a model's outputs and reveal the influence of specific training examples.
It quantitatively assesses training example importance by measuring prediction margin shifts, enabling robust brittleness and counterfactual testing.
The approach is validated on image classification benchmarks like CIFAR-10, uncovering data leakage and redundancies for improved model interpretability.

Overview of "Datamodels: Predicting Predictions from Training Data"

The paper "Datamodels: Predicting Predictions from Training Data" introduces a framework, termed datamodeling, to analyze the prediction behavior of a machine learning model based on its training data. Specifically, the paper formulates a datamodel as a parameterized function designed to predict the outcome of training a model on various subsets of a dataset and evaluating it on a target example. This approach presents a novel method for understanding the interplay between data and algorithms in generating predictions.

Key Concepts

Datamodeling is founded on multiple components:

Surrogate Function: The framework involves using a surrogate function to approximate complex functions (e.g., predictions after model training). In practice, the paper utilizes a linear datamodel to fulfill this role.
Datamodel Training Set: The process involves creating a training set for the surrogate function using subsets of the original dataset.
Linear Regression: The authors apply linear models as the surrogate for predicting model outputs. Despite the intricate nature of neural networks, linear datamodels are shown to effectively predict outcomes.

Methodology

The authors specialize datamodels for image classification tasks, particularly on datasets such as CIFAR-10 and FMoW. They derive collections of datamodels, targeting each test example and training example, measuring the outcome in terms of margins, rather than correctness, to avoid discrete labels biasing predictions.

The following methodology is employed:

Training Set Subsampling: Subsets are sampled from the training set based on a parameter $\alpha$ to build the datamodel training set.
Regression Analysis: Linear regression with $\ell_1$ -regularization is used to estimate datamodel parameters for predicting the performance on unseen or counterfactual data subsets.
Application and Testing: The datamodels are evaluated in their ability to predict on-distribution and out-of-distribution sets, demonstrating high accuracy and applicability on image datasets.

Applications and Findings

Datamodels provide insight into various facets of model behavior and learning:

Brittleness Testing: By estimating a model's dependency on certain training examples, the paper analyzes brittleness in predictions due to removal of specific data points. It finds that a significant number of CIFAR-10 predictions can be altered by removing a small percentage of the training dataset.
Counterfactual Predictions: Datamodels can predict potential shifts in outputs when subsets of training data are manipulated.
Training Example Importance: Weights in the linear datamodel signify training examples' impact, revealing semantically similar examples and data leakage instances.
Feature Embedding: Datamodel weights serve as a representation space for examples, facilitating analysis of data structure, clustering, and feature extraction.

Implications and Future Directions

The framework offers both theoretical and practical benefits. Theoretical advantages lie in its ability to model the dependence of predictions on data, while practical applications include enhancing dataset preprocessing by identifying redundancies and resolving data leakage issues.

Future work could explore:

Refined Sampling: Optimizing the subset sampling strategy to improve datamodel estimation.
Broader Applicability: Extending the framework beyond image classification to other domains.
Enhanced Estimation: Developing more sophisticated priors and regularization techniques for datamodels.

This paper introduces an innovative approach to understanding machine learning models through their training data, supported by strong numerical results, underscoring its contribution to the interpretability and robust training of neural networks.

PDF Markdown

Related Papers

GitHub

GitHub - MadryLab/datamodels-data: Data for "Datamodels: Predicting Predictions with Training Data" (97 stars)

Tweets

https://twitter.com/aleks_madry/status/1787564287986389277

https://twitter.com/aleks_madry/status/1750244650416554374

https://twitter.com/kellerjordan0/status/1761869822420546007

https://twitter.com/j_bollenbacher/status/1826664306446049738

YouTube

Show All Videos