Theoretical Models of Learning to Learn (2002.12364v1)

Published 27 Feb 2020 in cs.LG and stat.ML

Abstract: A Machine can only learn if it is biased in some way. Typically the bias is supplied by hand, for example through the choice of an appropriate set of features. However, if the learning machine is embedded within an {\em environment} of related tasks, then it can {\em learn} its own bias by learning sufficiently many tasks from the environment. In this paper two models of bias learning (or equivalently, learning to learn) are introduced and the main theoretical results presented. The first model is a PAC-type model based on empirical process theory, while the second is a hierarchical Bayes model.

Citations (729)

View on Semantic Scholar

Summary

The paper introduces two models—EP and Hierarchical Bayesian—that enable automatic bias learning across multi-task environments with theoretical sample-complexity guarantees.
The EP model provides convergence bounds and outlines the necessary number of tasks and examples to ensure reliable generalization of hypothesis spaces.
The Bayesian model adapts hyper-priors using observed data, demonstrating practical improvements in applications like speech, face, and handwriting recognition.

Theoretical Models of Learning to Learn

The paper "Theoretical Models of Learning to Learn" by Jonathan Baxter explores the realms of bias learning, also referred to as "learning to learn," through the introduction of two theoretical models grounded in Empirical Process (EP) theory and Hierarchical Bayesian frameworks. This work systematically explores how a ML system can autonomously learn its own bias across an environment of related tasks, offering both theoretical insights and potential practical implications in domains rich with related tasks, such as speech recognition, face recognition, and handwriting recognition.

Conceptual Underpinning

The foundational concept highlighted in this paper is that machine learning inherently requires some form of bias to be effective. Traditionally, this bias has been manually injected by experts through feature selection, which can be restrictive and limited by human expertise. The essence of bias learning is to automatize this process, enabling machines to derive an appropriate bias from data encountered across multiple related tasks. Both the EP and Bayesian models postulate that the learner is embedded within an environment of related tasks from which it samples.

Overview of Models

Empirical Process (EP) Model

In the EP model, the learner aims to find a hypothesis space, $H$ , that can generalize well across tasks sampled from the environment. This generalization is quantified through theoretical bounds on converging empirical loss to the expected loss over the entire environment. The key contribution of the model is a theorem which provides guarantees on how many tasks $n$ and examples per task $m$ are required to ensure that a hypothesis space contains good solutions to novel tasks with high probability. The results show that both $n$ and $m$ must be sufficiently large to achieve reliable generalization.

### Key Theorem in the EP Model

The theorem posits that if:

%%%%9%%%%

and

%%%%10%%%%

then with high probability %%%%5%%%%, the empirical loss %%%%6%%%% will behave close to the true error %%%%7%%%% within an %%%%8%%%% bound.

Hierarchical Bayesian Model

The Bayesian model enhances this framework by interpreting the task distribution within the environment as an objective prior distribution, represented by hyper-priors over possible task priors. This model updates the hyper-prior to a hyper-posterior based on observed data, facilitating the learning of novel tasks from the inferred distributions. Here, the formal definition of learning to learn is tied to the Kullback-Leibler (KL) divergence and the information-theoretic risk decaying to its minimum as tasks and examples grow.

Practical Applications and Theoretical Implications

The theoretical advancements outlined in these models translate into practical approaches for learning common biases applicable to various related tasks. One compelling application detailed in the paper is feature learning using neural networks. The empirical analysis and theorems suggest that as the number of tasks increases, the number of examples required for each task significantly decreases, facilitating efficient and scalable learning of features across a multi-task environment.

Feature Learning Example

The paper examines neural networks where feature maps learned through multiple related tasks can boost the performance of subsequent learning tasks. It demonstrates that the convergence bounds and sample complexities derived for EP models are applicable, thereby affirming their practical significance in crafting efficient representation learning.

Future Directions

The analytical frameworks provided herein pave the way for several future research trajectories. One pertinent area is extending these models to more complex environments and tasks, validating and refining theoretical bounds through extensive empirical evaluation. Another avenue is enhancing the expressive power of the hypothesis spaces considered, potentially incorporating state-of-the-art neural architectures and broader types of biases.

Conclusion

Jonathan Baxter's paper provides a robust theoretical foundation for bias learning through two insightful models. Both the EP and Bayesian approaches underscore the feasibility of a machine learning its own bias within an environment of related tasks — minimizing manual intervention and enhancing automated, scalable learning processes. The convergence bounds and sample complexity insights crucially contribute to the theoretical understanding of multi-task learning, representing a significant step forward in the quest for smarter, self-adaptive learning systems. As these models mature, their practical impact across various domains with rich task environments stands to be profoundly transformative.

PDF Markdown