Deep ReLU Networks Have Surprisingly Few Activation Patterns (1906.00904v2)

Published 3 Jun 2019 in stat.ML, cs.LG, math.ST, and stat.TH

Abstract: The success of deep networks has been attributed in part to their expressivity: per parameter, deep networks can approximate a richer class of functions than shallow networks. In ReLU networks, the number of activation patterns is one measure of expressivity; and the maximum number of patterns grows exponentially with the depth. However, recent work has showed that the practical expressivity of deep networks - the functions they can learn rather than express - is often far from the theoretical maximum. In this paper, we show that the average number of activation patterns for ReLU networks at initialization is bounded by the total number of neurons raised to the input dimension. We show empirically that this bound, which is independent of the depth, is tight both at initialization and during training, even on memorization tasks that should maximize the number of activation patterns. Our work suggests that realizing the full expressivity of deep networks may not be possible in practice, at least with current methods.

Citations (205)

View on Semantic Scholar

Summary

The paper shows that deep ReLU networks exhibit a notably limited number of activation patterns compared to the theoretical exponential growth.
It derives an average upper bound on activation patterns based solely on neuron count and input dimensions, independent of network depth.
Experimental results confirm that training does not significantly increase activation pattern diversity, underscoring inherent architectural constraints.

An Analysis of Activation Patterns in Deep ReLU Networks

The paper "Deep ReLU Networks Have Surprisingly Few Activation Patterns" by Boris Hanin and David Rolnick presents a rigorous examination of the practical expressivity of deep ReLU neural networks. While theoretical results suggest that the number of activation patterns in such networks can grow exponentially with depth, this work investigates the extent to which this potential is realized during network initialization and training.

At the outset, the authors set the stage by noting that while deep networks are celebrated for their expressivity, the number of activation patterns they realize in practice might be substantially less than the theoretical maximum. They introduce the notion of activation patterns, determined by the configuration of neurons switching on or off, and connect these patterns to the expressivity of the network—effectively how rich a class of functions the network can learn to approximate.

An important contribution of this work is the derivation of an average upper bound on the activation patterns of a ReLU network at initialization. They show that the expected number of activation patterns is confined by the total number of neurons raised to the power of the input dimension, independent of the network's depth. This finding implies a fundamental limitation in the typical expressivity of ReLU networks, echoing empirical observations but now grounded in theoretical derivations.

The authors provide substantial experimental evidence to support their theoretical findings. Investigations demonstrate that even as networks are trained, including on complex or memorization-heavy tasks, the number of realized activation patterns does not approach the theoretical maximum. Instead, this bound appears tight across various settings, pointing to a structural constraint on the behavior of these models.

Several critical implications arise from this research. The disconnect between theoretical potential and practical utilization of activation patterns suggests that factors other than sheer depth are central to the networks' success in practical tasks. This positions depth as possibly more relevant to facilitating effective optimization than expanding expressive capacity. Moreover, the results prompt a re-evaluation of how neural networks can be designed to better utilize their deeper theoretical potential or whether heuristics in network architectures should evolve not based solely on maximized depth.

In future work, the exploration of this discrepancy could focus on alternative activation functions or architectures that are less constrained by the outlined bounds. Moreover, the insights on initialization and parameter distribution furnish groundwork for potential optimizations that could dynamically adjust these elements throughout training to leverage deeper levels of expressivity.

The robust nature of the theoretical framework presented and the empirical rigor underpinning the conclusions provide a sound basis for further exploration and optimization of neural network architectures. The paper's contributions afford a valuable perspective on understanding the real-world function learning landscape of deep ReLU networks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/2wlearning/status/1784605485582487786