Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recurrent Dropout without Memory Loss (1603.05118v2)

Published 16 Mar 2016 in cs.CL

Abstract: This paper presents a novel approach to recurrent neural network (RNN) regularization. Differently from the widely adopted dropout method, which is applied to \textit{forward} connections of feed-forward architectures or RNNs, we propose to drop neurons directly in \textit{recurrent} connections in a way that does not cause loss of long-term memory. Our approach is as easy to implement and apply as the regular feed-forward dropout and we demonstrate its effectiveness for Long Short-Term Memory network, the most popular type of RNN cells. Our experiments on NLP benchmarks show consistent improvements even when combined with conventional feed-forward dropout.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Stanislau Semeniuta (3 papers)
  2. Aliaksei Severyn (29 papers)
  3. Erhardt Barth (12 papers)
Citations (217)

Summary

Recurrent Dropout without Memory Loss

This paper presents a method for regularizing Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory networks (LSTMs), which mitigates the common issue of overfitting associated with these architectures. The proposed method extends the traditional concept of dropout — commonly utilized for feed-forward neural networks — to the recurrent connections of RNNs. The methodological advancement described in this work circumvents the potential drawback of compromising the network's long-term memory, a significant challenge faced when adapting dropout techniques to recurrent architectures.

The proposed recurrent dropout differs from alternative methods, as the authors opt to apply dropout to the cell update vectors within LSTMs, rather than the hidden states or gates themselves. This strategy effectively prevents the gradual degradation of memory that might result from scaling transformations applied to recurrent connections during inference. Through this technique, the paper resolves the problem that arises when dropout directly affects the hidden states, which can lead to exponential scaling issues that deteriorate the model's memory retention over time.

The authors support their claims with extensive experimentation across a number of NLP tasks and benchmarks, including Named Entity Recognition, LLMing at both the word and character levels, and Twitter Sentiment Analysis. The empirical results consistently show that their method leads to measurable improvements in predictive performance across these tasks. Notably, their approach yields superior performance when combined with forward dropout, enhancing the overall regularization effect without impairing the LSTMs’ ability to model long-term dependencies in sequential data — a critical consideration for NLP tasks.

Significantly, the paper also confronts the challenge of dropout mask sampling strategies. Where existing methods rely on per-sequence sampling, which limits the application scope and efficiency in RNNs, the authors provide quantitative evaluations showing that their proposed per-step sampling delivers comparable, if not superior, results. This flexibility is advantageous, as it provides broader applicability of the dropout scheme without compromising training effectiveness or model memory capabilities.

Furthermore, the proposed method’s integration into existing neural network frameworks is straightforward since it conforms with standard implementations of dropout for feed-forward connections. This ease of integration, alongside demonstrated empirical efficacy, suggests potential for broad adoption in the development of robust RNN models, particularly in domains where overfitting and memory retention are primary concerns. As deep learning tools evolve, future research might explore this dropout strategy's applicability beyond NLP tasks, such as in sequence-to-sequence prediction or different domains like speech recognition, offering a robust regularization technique adaptable to diverse application contexts.