Half-Space Feature Learning in Neural Networks (2404.04312v1)
Abstract: There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.
- On exact computation with an infinitely wide neural net. In Neural Information Processing Systems (NeurIPS).
- Neural Networks as Kernel Learners. The Silent Alignment Effect. In In International Conference on Learning Representations.
- Implicit Regularization via Neural Feature Alignment. In AISTATS.
- Neural networks can learn representations with gradient descent. In Conference on Learning Theory, 5413–5452. PMLR.
- Learning parities with neural networks. Advances in Neural Information Processing Systems, 33: 20356–20365.
- Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems, 33: 5850–5861.
- Complexity of Linear Regions in Deep Networks. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, 2596–2604. PMLR.
- Deep ReLU Networks Have Surprisingly Few Activation Patterns. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Adaptive Mixtures of Local Experts. Neural Computation, 3(1): 79–87.
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Neural Path Features and Neural Path Kernel : Understanding the role of gates in deep learning. In Neural Information Processing Systems (NeurIPS.
- On the Number of Linear Regions of Deep Neural Networks. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N.; and Weinberger, K., eds., Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
- The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems, 32.
- Failures of gradient-based deep learning. In International Conference on Machine Learning, 3067–3075. PMLR.
- A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features. In International Conference on Learning Representations.
- Telgarsky, M. 2016. benefits of depth in neural networks. In Feldman, V.; Rakhlin, A.; and Shamir, O., eds., 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, 1517–1539. Columbia University, New York, New York, USA: PMLR.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.