Function-space Parameterization of Neural Networks for Sequential Learning (2403.10929v1)

Published 16 Mar 2024 in stat.ML and cs.LG

Abstract: Sequential learning paradigms pose challenges for gradient-based deep learning due to difficulties incorporating new data and retaining prior knowledge. While Gaussian processes elegantly tackle these problems, they struggle with scalability and handling rich inputs, such as images. To address these issues, we introduce a technique that converts neural networks from weight space to function space, through a dual parameterization. Our parameterization offers: (i) a way to scale function-space methods to large data sets via sparsification, (ii) retention of prior knowledge when access to past data is limited, and (iii) a mechanism to incorporate new data without retraining. Our experiments demonstrate that we can retain knowledge in continual learning and incorporate new data efficiently. We further show its strengths in uncertainty quantification and guiding exploration in model-based RL. Further information and code is available on the project website.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a sparse function-space representation that converts pre-trained neural networks to Gaussian processes using dual parameterization.
It leverages sparse approximation with inducing points to reduce computational complexity and quickly integrate new data without full retraining.
Experiments demonstrate improved uncertainty quantification and memory retention in sequential learning tasks such as Split-MNIST and Permuted-MNIST.

Sparse Function-space Representation of Neural Networks for Sequential Learning

Introduction

Recent research introduces a Sparse Function-space Representation (sfr) for converting trained Neural Networks (NNs) into Gaussian Processes (GPs) using a dual parameterization technique. This method addresses the scalability issues associated with Gaussian Processes when applied to large datasets and complex inputs such as images, leveraging the strengths of both NNs and GPs. This paper discusses the methodology behind sfr, its practical implications, and showcases its effectiveness through a series of experiments.

Methodology

The sfr approach linearizes a trained NN around its maximum a posteriori (MAP) weights, effectively converting it from weight-space to function-space representation through dual parameterization. This transition is encapsulated by formulating predictions with a Bayesian Generalized Linear Model (GLM) and leveraging the concept of Neural Tangent Kernels (NTKs). The innovation lies in the introduction of sparse dual parameters which enable efficient scaling and the assimilation of new data without retraining from scratch.

Dual Parameterization

The core concept of sfr is the use of dual parameters, $\valpha$ and $\vbeta$, derived from the MAP objective in function space. These parameters capture the first and second derivatives of the likelihood, facilitating the approximation of the GP posterior without necessitating subset approximations or additional optimization.

Sparse Approximation

By projecting these dual parameters onto a set of inducing points, sfr efficiently represents the full data set in a sparse manner, allowing for scalability to larger data sets. This is achieved by summarizing the effect of all training points on these inducing points, a technique that greatly reduces computational complexity.

Practical Implications

Continual Learning

In scenarios where access to prior data is restricted, sfr provides a means for retaining knowledge from previous tasks through function-space regularization. This is particularly useful in continual learning applications, where sfr's capability to maintain a condensed representation of learned information can mitigate catastrophic forgetting.

Incorporating New Data

sfr enables the integration of new data into the existing model framework via dual updates. This feature not only saves computational resources by avoiding retraining from scratch but also ensures swift adaptation to new information, making sfr particularly suited for dynamic and sequential learning tasks.

Experiments and Results

Supervised Learning

Experiments demonstrate sfr's effectiveness in supervised learning tasks, including regression and classification on UCI datasets and image datasets such as Fashion-MNIST and CIFAR-10. sfr outperforms both the GP subset approach and the Laplace approximation in uncertainty quantification, highlighting its superior scalability and efficiency.

Sequential Learning

sfr proves advantageous in sequential learning contexts, particularly in continual learning benchmarks such as Split-MNIST and Permuted-MNIST. The method's function-space regularization notably enhances knowledge retention across tasks without requiring direct access to old data. Additionally, sfr's capability to incorporate new data fast through dual updates showcases its potential for applications requiring rapid model updates in response to new information.

Conclusion

The introduction of Sparse Function-space Representation (sfr) offers a promising avenue for merging the strengths of NNs and GPs, addressing key challenges in scalability, uncertainty quantification, and sequential learning. sfr's dual parameterization and sparse approximation techniques provide a robust framework for efficient learning in both static and dynamic environments, making it a valuable tool for a wide range of machine learning applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/StatMLPapers/status/1769937807567434080