Learning Internal Representations (COLT 1995) (1911.05781v3)

Published 13 Nov 2019 in cs.LG and stat.ML

Abstract: Probably the most important problem in machine learning is the preliminary biasing of a learner's hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt. In this paper a mechanism for {\em automatically} learning or biasing the learner's hypothesis space is introduced. It works by first learning an appropriate {\em internal representation} for a learning environment and then using that representation to bias the learner's hypothesis space for the learning of future tasks drawn from the same environment. An internal representation must be learnt by sampling from {\em many similar tasks}, not just a single task as occurs in ordinary machine learning. It is proved that the number of examples $m$ {\em per task} required to ensure good generalisation from a representation learner obeys $m = O(a+b/n)$ where $n$ is the number of tasks being learnt and $a$ and $b$ are constants. If the tasks are learnt independently ({\em i.e.} without a common representation) then $m=O(a+b)$. It is argued that for learning environments such as speech and character recognition $b\gg a$ and hence representation learning in these environments can potentially yield a drastic reduction in the number of examples required per task. It is also proved that if $n = O(b)$ (with $m=O(a+b/n)$) then the representation learnt will be good for learning novel tasks from the same environment, and that the number of examples required to generalise well on a novel task will be reduced to $O(a)$ (as opposed to $O(a+b)$ if no representation is used). It is shown that gradient descent can be used to train neural network representations and experiment results are reported providing strong qualitative support for the theoretical results.

Citations (395)

View on Semantic Scholar

Summary

The paper introduces a framework for automatically learning hypothesis spaces by leveraging shared structures across multiple tasks.
The paper derives that sample complexity per task reduces from O(a+b) to O(a+b/n), ensuring efficient generalization.
The paper demonstrates through neural network experiments that gradient descent can accurately capture transferable internal representations.

Learning Internal Representations

The paper "Learning Internal Representations," authored by Jonathan Baxter, presents a significant exploration into the domain of machine learning, particularly focusing on the automatic selection of a learner's hypothesis space through the development of internal representations. Rather than exploring the traditional focus of machine learning—optimizing a given hypothesis space—it introduces a framework that facilitates the automated learning of hypothesis spaces across multiple tasks. This methodology addresses the critical issue of balancing the trade-off between the size of the hypothesis space and the capability to generalize from training data.

The concept of learning internal representations is predicated on the idea that models can leverage shared structures among numerous tasks to bias hypothesis spaces more effectively. This approach significantly differs from conventional models that focus on individual tasks, which may not fully capture beneficial biases due to limited data exposure. By sampling from "many similar tasks," it is possible to define more efficient hypothesis spaces that reduce the sample complexity per task.

The mathematical framework established in the paper satisfies stringent theoretical guarantees. For a given number of tasks $n$ and examples per task $m$ , the paper derives that the number of samples per task required to ensure good generalization diminishes as $O(a + b/n)$ where $a$ and $b$ are environment-specific constants. This stands in contrast to the independent task learning scenario requiring $m=O(a + b)$ . Therefore, internal representation learning can drastically reduce the sampling cost, especially in domains where $b \gg a$ such as in speech or character recognition tasks.

Moreover, the paper provides theoretical bounds ensuring the quality of learned representations. Specifically, if $n = O(b)$ , the representation will facilitate generalization to novel tasks from the same environment with only $m = O(a)$ samples needed for effective learning, a significant improvement over learning without shared representations. This theoretical backing is supported by empirical evidence presented in this research through experiments involving neural networks using gradient descent. These networks are trained to learn transferable representations across tasks characterized by translationally invariant Boolean functions.

The experimental results underscore how gradient descent can be effectively used to identify neural network configurations that capture these internal structures, with empirical trials affirming the theoretical results. These representations, once trained, enable rapid adaptation to new tasks with minimal retraining data.

The implications of this research are profound both in theory and application. The method facilitates efficiency in environments with high task similarity, offering potential reductions in computational resources and time. Such approaches open avenues for developing adaptive intelligent systems capable of rapidly learning varied but related tasks—core attributes desired in advancing machine learning and AI technologies.

In future research, further exploration into other environments and task distributions could extend these results, integrating this framework into broader deep learning and AI systems. Additionally, refinement of neural network architectures and the paper of other potential methods for representation learning could present further avenues for reducing the complexity of learned models across increasingly complex domains. Overall, the insights drawn from this paper lay groundwork for ongoing research into automating learning across diverse tasks, refining both the theory and practice of machine learning models.