Capacity and Trainability in Recurrent Neural Networks (1611.09913v3)

Published 29 Nov 2016 in stat.ML, cs.AI, cs.LG, and cs.NE

Abstract: Two potential bottlenecks on the expressiveness of recurrent neural networks (RNNs) are their ability to store information about the task in their parameters, and to store information about the input history in their units. We show experimentally that all common RNN architectures achieve nearly the same per-task and per-unit capacity bounds with careful training, for a variety of tasks and stacking depths. They can store an amount of task information which is linear in the number of parameters, and is approximately 5 bits per parameter. They can additionally store approximately one real number from their input history per hidden unit. We further find that for several tasks it is the per-task parameter capacity bound that determines performance. These results suggest that many previous results comparing RNN architectures are driven primarily by differences in training effectiveness, rather than differences in capacity. Supporting this observation, we compare training difficulty for several architectures, and show that vanilla RNNs are far more difficult to train, yet have slightly higher capacity. Finally, we propose two novel RNN architectures, one of which is easier to train than the LSTM or GRU for deeply stacked architectures.

Citations (201)

View on Semantic Scholar

Summary

The paper demonstrates that diverse RNN architectures achieve comparable capacity, with about 5 bits of information stored per parameter.
It highlights that effective training, rather than architectural differences, is key to performance, noting vanilla RNNs are harder to train than gated models.
The study introduces novel architectures that simplify training in deep RNN stacks, offering practical insights for both theory and real-world applications.

Capacity and Trainability in Recurrent Neural Networks

The paper "Capacity and Trainability in Recurrent Neural Networks" by Collins, Sohl-Dickstein, and Sussillo investigates crucial aspects of recurrent neural networks (RNNs), namely their capacity to store information about tasks and input histories, as well as their trainability. The authors conducted a series of experiments to elucidate the performance limitations and capabilities intrinsic to different RNN architectures. This examination extends across common RNN frameworks, including vanilla RNNs, Long Short Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs), alongside two novel architectures introduced in the paper.

Key Findings and Results

The paper finds that with meticulous training, all prevalent RNN structures reach similar bounds in terms of task- and unit-level capacity. A salient discovery is that RNNs can typically encode task information in a linear scaling with the number of network parameters, quantified approximately as 5 bits per parameter. Moreover, these networks can represent around one real number from input history per hidden unit.

The authors determined that for several tasks, the per-task parameter capacity predominantly influences performance rather than differences in recurrent architectures. This observation suggests that many divergences attributed to architectural differences may, in fact, result from disparities in training effectiveness.

Notably, the paper also probes the complexity surrounding the training of various RNN architectures. It is observed that vanilla RNNs, albeit possessing slightly higher capacity, pose greater challenges during training compared to their gated counterparts, such as LSTMs and GRUs. The paper introduces two novel architectures, one of which reportedly simplifies the training process compared to conventional LSTM and GRU models, especially in deeply stacked architectures.

Experimental Approach

The empirical experiments leveraged robust hyperparameter tuning to thoroughly explore the artifacts influencing RNN performance. The task suite used for these analyses spanned memorization tasks, LLMing, arithmetic tasks, and others designed to dissect different operational facets of RNNs.

For capacity evaluation, a primary focus was the per-parameter and per-unit capacities in task storage and input memory retention respectively. The researchers quantified how information storage scales with parameter count across different architectures and determined that the architectures do not significantly differ in computational bottlenecks related to common RNN primitives.

Implications and Future Research Directions

This examination into capacity and trainability underpins a more profound understanding of RNN limitations and capabilities. On the practical side, the insights can guide the choice of RNN architectures under different computational and training resource constraints, influencing model selection in real-world deployments.

Theoretically, merging these findings with ongoing advancements in architectural innovations can spur developments into more efficient RNN designs. Future research could delve into exploring gradients and optimization landscapes unique to RNNs, potentially unearthing novel training paradigms that further enhance performance while reducing computational burdens.

Furthermore, the discourse on per-parameter capacity aligns interestingly with neurobiological studies on synaptic capacities, suggesting fertile ground for interdisciplinary exploration. Such cross-disciplinarity may inspire biomimetic designs in AI, fostering models that emulate the efficiency observed in biological systems.

Thus, while this paper abstains from hyperbole, it establishes a substantive bedrock for both applied machine learning and theoretical exploration, marking a notable progression in understanding RNN inner workings.

Related Papers

Tweets

https://twitter.com/YesThisIsLion/status/1930414409677189509

https://twitter.com/NicolasZucchet/status/1828772748191318361