Assessing Generalization of SGD via Disagreement (2106.13799v2)

Published 25 Jun 2021 in cs.LG, cs.AI, and stat.ML

Abstract: We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakkiran & Bansal '20, which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the \emph{well-calibrated} nature of \emph{ensembles} of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.

Citations (100)

View on Semantic Scholar

Summary

The paper shows that the test error of deep networks can be estimated from the disagreement rate between two independently trained models.
It employs stochastic variations in data ordering and initialization to validate the Generalization Disagreement Equality across architectures and datasets.
The study implies that calibration metrics can reliably guide unsupervised evaluation of model generalization, paving the way for robust performance prediction.

Assessing Generalization via Disagreement: An Overview

The paper "Assessing Generalization via Disagreement" explores a compelling empirical phenomenon in deep network training processes. The authors demonstrate that the test error of deep learning models can be approximated through the disagreement rate between two independently-trained networks on the same data set, devoid of labeling. This not only provides practical advancements in estimating model generalization but also uncovers significant theoretical linkages between generalization and calibration within deep learning models.

Methodological Insights

The authors build on a prior observation that suggests disagreement between two models trained on separate data correlates with both models' test errors. They extend this finding to a more intriguing scenario: when two models share the same training data but differ in random initializations or data ordering, their disagreement still aligns closely with the test error. Through experiments primarily focusing on convolutional architectures like ResNet18 across datasets such as CIFAR-10, CIFAR-100, and SVHN, the researchers provide robust evidence supporting this phenomenon.

Their methodology evaluates varied sources of stochasticity, specifically through differing data orderings and initializations, further substantiating that these variance-inducing factors are almost equally effective in maintaining the Generalization Disagreement Equality (GDE). An important aspect of their experimental setup is the empirical verification across different hyperparameters, such as model depth and width, expanding the understanding of when and why this phenomenon occurs.

Theoretical Contributions

Central to the theoretical exploration is the concept of calibration, traditionally referring to a model's predictive confidence matching actual outcomes. The authors propose a connection between this probabilistic calibration and generalization through the disagreement measure. Formally, they establish that if an ensemble of models—derived from stochastic variations like initialization or data orderings—is well-calibrated, the disagreement rate will closely mirror the test error, encapsulating GDE.

Additionally, the paper introduces a calibration metric termed Class Aggregated Calibration Error (CACE) for quantifying deviations from the ideal GDE condition. Their findings underscore that low CACE values align with reduced gaps between disagreement and test error, affirming that calibration errors effectively bound the approximation error in unsupervised test error estimation.

Empirical Validation and Scope

The empirical validations venture into novel domains by gauging out-of-distribution settings on the PACS dataset, assessing whether the GDE persists under distribution shifts. This examination touches upon pre-training as a factor in calibration and the transference of this equality under non-identical data distributions.

Moreover, despite its abstract formulation, the practicality of using two models exemplifies the reliability and efficiency of this estimation technique, as shown by small variances in test error and high correlation coefficients, $R^2$ and Kendall's $\tau$ , between disagreement rates and actual test errors in identified settings.

Implications and Future Directions

The research provides a promising avenue for unsupervised model evaluation, specifically where labels are scarce or difficult to obtain due to regulatory or privacy constraints. By bridging theoretical insights with empirical observations, the authors pave the way for exploratory calibration frameworks in stochastic learning processes, especially within the field of deep networks.

Looking forward, this work encourages further clarification of the situations where such calibration-driven phenomena manifest, alongside characterizing the stochastic conditions that predominantly influence model behavior. Ultimately, these established connections could redefine how model robustness and reliability are assessed, fundamentally leveraging stochastic divergences to predict generalization performance with minimal supervision.

In summary, "Assessing Generalization via Disagreement" contributes substantially to our understanding of model evaluation, melding empirical phenomenon discovery with a robust theoretical framework, establishing a richer conceptual layering within machine learning theory.

PDF Markdown

Related Papers

Tweets

https://twitter.com/kellerjordan0/status/1824605952613683338

https://twitter.com/kellerjordan0/status/1765757316249600511