Sobolev Training for Neural Networks (1706.04859v3)

Published 15 Jun 2017 in cs.LG

Abstract: At the heart of deep learning we aim to use neural networks as function approximators - training them to produce outputs from inputs in emulation of a ground truth function or data creation process. In many cases we only have access to input-output pairs from the ground truth, however it is becoming more common to have access to derivatives of the target output with respect to the input - for example when the ground truth function is itself a neural network such as in network compression or distillation. Generally these target derivatives are not computed, or are ignored. This paper introduces Sobolev Training for neural networks, which is a method for incorporating these target derivatives in addition the to target values while training. By optimising neural networks to not only approximate the function's outputs but also the function's derivatives we encode additional information about the target function within the parameters of the neural network. Thereby we can improve the quality of our predictors, as well as the data-efficiency and generalization capabilities of our learned function approximation. We provide theoretical justifications for such an approach as well as examples of empirical evidence on three distinct domains: regression on classical optimisation datasets, distilling policies of an agent playing Atari, and on large-scale applications of synthetic gradients. In all three domains the use of Sobolev Training, employing target derivatives in addition to target values, results in models with higher accuracy and stronger generalisation.

Citations (216)

View on Semantic Scholar

Summary

The paper introduces Sobolev Training, a method that incorporates derivative losses into the optimization process to improve function approximation.
It extends universal approximation theory to Sobolev spaces by proving that ReLU networks can effectively model both functions and their derivatives.
Empirical results demonstrate reduced errors in regression, policy distillation, and synthetic gradients, indicating enhanced generalization and efficiency.

Sobolev Training for Neural Networks: A Summary

The paper "Sobolev Training for Neural Networks" introduces an advanced technique that integrates derivative information into the training process of neural networks, referred to as Sobolev Training. This approach seeks to refine the classical neural network training paradigm, which predominantly focuses on approximating a function's output from input-output pairs. The authors posit that when derivative information is available, it can be utilized to enhance the learning efficiency and generalization of neural networks.

Theoretical Underpinnings

The theoretical motivation behind Sobolev Training is grounded in universal approximation theorems. Traditionally, neural networks are seen as universal approximators of functions in L2 spaces. This work extends this perspective to Sobolev spaces, which consider both function values and their derivatives. The paper provides theoretical proofs that configurations with ReLU (Rectified Linear Unit) activations can serve as universal approximators in Sobolev spaces, despite their non-boundedness and lack of derivative continuity.

Formulation and Methodology

Sobolev Training augments the traditional empirical loss minimization by incorporating losses over derivatives of the function being approximated. The modified objective function includes terms that ensure that the neural network's derivatives align with those of the target function. This is mathematically formulated to include derivative losses up to a specified order. In practical terms, this means that the loss computation involves evaluating both the target and model derivatives, which can be performed through backpropagation.

The authors also address computational concerns associated with calculating high-dimensional Jacobians or Hessians by proposing a stochastic approximation method. This involves projecting the derivatives onto random vectors, significantly reducing the computational overhead.

Empirical Evidence and Applications

The paper provides empirical evidence across several domains:

Regression on Classical Datasets: It demonstrates that Sobolev Training can significantly reduce approximation error and enhance generalization, especially in low-data regimes. Notably, it illustrates a lower test error on benchmark optimization problems compared to traditional training approaches.
Policy Distillation: The paper presents applications in policy distillation for reinforcement learning, where aligning derivatives improves the fidelity of the distilled policy to the original. Sobolev-trained networks were shown to outperform in terms of aligning both predicted action probabilities and their gradients.
Synthetic Gradients: The technique is further applied to synthetic gradient estimation, a method that decouples weight updates in different sections of a neural network. The paper reports superior performance in accuracy measures on large-scale datasets like ImageNet, reinforcing the practical utility of Sobolev Training.

Implications and Future Directions

The findings implicate that Sobolev Training offers a nuanced way of encoding more intricate features of the target functions within neural networks. It suggests a pathway to improve learning efficiency by utilizing available gradient information, thereby reducing the sample complexity. The paper interestingly hints at possible extensions of the technique, such as encoding uncertainty or curvature in second-order derivatives, potentially expanding Sobolev Training's applicability in areas like meta-learning or generative modeling.

Future research could explore optimal architectural modifications that complement Sobolev Training, efficient stochastic approximations for very deep networks, and practical implementations in black-box scenarios where derivatives must be estimated.

In conclusion, the paper provides robust theoretical and empirical support for Sobolev Training, offering neural network practitioners a potent tool for leveraging gradient information in predictive modeling tasks. Such an approach is poised to enhance both the predictive power and interpretability of machine learning models, making it an attractive avenue for future research and application in AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ericjang11/status/1829719326305501224