What is the Effect of Importance Weighting in Deep Learning?

Published 8 Dec 2018 in cs.LG and stat.ML | (1812.03372v3)

Abstract: Importance-weighted risk minimization is a key ingredient in many machine learning algorithms for causal inference, domain adaptation, class imbalance, and off-policy reinforcement learning. While the effect of importance weighting is well-characterized for low-capacity misspecified models, little is known about how it impacts over-parameterized, deep neural networks. This work is inspired by recent theoretical results showing that on (linearly) separable data, deep linear networks optimized by SGD learn weight-agnostic solutions, prompting us to ask, for realistic deep networks, for which many practical datasets are separable, what is the effect of importance weighting? We present the surprising finding that while importance weighting impacts models early in training, its effect diminishes over successive epochs. Moreover, while L2 regularization and batch normalization (but not dropout), restore some of the impact of importance weighting, they express the effect via (seemingly) the wrong abstraction: why should practitioners tweak the L2 regularization, and by how much, to produce the correct weighting effect? Our experiments confirm these findings across a range of architectures and datasets.

Abstract PDF Upgrade to Chat

Citations (421)

View on Semantic Scholar

Summary

The paper reveals that importance weighting significantly influences early training stages but its effect diminishes as over-parameterized models separate the training data.
The paper demonstrates that regularization methods such as L2 and batch normalization can partially reintroduce weighting effects, though inconsistently across experiments.
The paper’s findings challenge the routine use of importance weighting for tasks like class imbalance and domain adaptation, prompting reconsideration of its application.

Understanding the Impact of Importance Weighting in Deep Learning

The paper "What is the Effect of Importance Weighting in Deep Learning?" by Jonathon Byrd and Zachary C. Lipton addresses a significant yet underexplored aspect of machine learning—how importance weighting affects deep neural networks. Importance weighting is an established technique used in a variety of machine learning contexts, such as causal inference, domain adaptation, and reinforcement learning. However, its impact on over-parameterized neural networks, which have become predominant in recent years, is not well-understood. This paper provides crucial insights into this area by studying the interaction between importance weighting and deep learning models.

Key Findings

The authors investigate the effects of importance weights across multiple deep learning architectures, tasks, and datasets. Their surprising primary conclusion is that while importance weighting affects models notably early in the training process, its influence markedly diminishes as training continues. This holds true as long as the neural networks are able to separate the training data—a common capability, given the over-parameterization typical of modern deep networks.

Moreover, while techniques such as L2 regularization and batch normalization can restore some effects of importance weighting, they do so inconsistently and not through a straightforward mechanism. For instance, batch normalization, while showing some interaction with importance weighting, does not clearly articulate its mode of influence, which underscores the complexity of interactions within neural network training.

Experimental results support these findings across a variety of networks and datasets. Key experiments involve classification tasks on both standard datasets, like CIFAR-10, and synthetic datasets where clear visualization of decision boundaries is possible. Notably, the experimental results from CIFAR-10 and the Microsoft Research Paraphrase Corpus indicate a significant reduction in importance weight effects as training progresses, illustrating the robustness of these observations.

Implications and Future Speculations

These findings have practical implications for the application of importance weighting in deep learning. They challenge the traditionally accepted practice of using importance weighting indiscriminately in modern deep networks. The results suggest reconsidering its use, particularly in settings where it might be assumed to play a critical role, such as correcting for class imbalance or domain shift.

Theoretically, this work aligns with contemporary understandings of deep learning's implicit bias observed in over-parameterized models. By demonstrating the diminishing role of importance weighting, it invites further exploration into how optimization dynamics contribute to this behavior. This understanding could lead to novel approaches for managing the impact of weighting and other training hyperparameters more effectively.

Moving forward, further research could explore understanding the interactions of regularization methods with importance weighting. Exploring the mechanisms by which batch normalization and dropout affect these dynamics could prove invaluable. Additionally, the study's results advocate for developing principled approaches to hyperparameter tuning when using importance weighting, potentially integrating these insights into automated hyperparameter optimization frameworks.

In conclusion, Byrd and Lipton's work highlights the need for a reevaluation of importance weighting's role in deep learning. By providing empirical evidence of its limited long-term effects, the paper opens multiple avenues for future research into training dynamics and optimization strategies, advancing our understanding of how to utilize deep networks’ capacity most effectively.

Markdown