Weight-Dropped LSTM
- Weight-dropped LSTM is a recurrent network that applies DropConnect to hidden-to-hidden matrices to reduce overfitting in sequence models.
- The method integrates stochastic recurrent weight masking with conventional LSTM recurrence, yielding state-of-the-art perplexity on PTB and WikiText-2 datasets.
- Empirical results demonstrate that weight-dropping improves long-range temporal stability and overall generalization without disrupting optimized LSTM kernels.
A weight-dropped LSTM is a Long Short-Term Memory (LSTM) network in which the recurrent (hidden-to-hidden) weight matrices are regularized using DropConnect. This technique, introduced in "Regularizing and Optimizing LSTM LLMs" (Merity et al., 2017), aims to improve generalization in sequence modeling—particularly word-level language modeling—by stochastically masking elements of the recurrent matrices during training. This approach focuses on mitigating overfitting and enhancing long-range temporal stability without interfering with black-box, highly-optimized LSTM kernels such as those in cuDNN.
1. Standard LSTM Recurrence Relations
The conventional LSTM processes sequential data via the following recurrence:
Here, is the input at time , the hidden and cell state, and the input-to-hidden and hidden-to-hidden weight matrices, biases, the sigmoid nonlinearity, and the elementwise product.
2. DropConnect-Based Recurrent Regularization
In a weight-dropped LSTM, DropConnect is applied to each hidden-to-hidden matrix . For each , a binary mask is sampled:
yielding a masked recurrent matrix
Throughout a given batch, remains fixed for both forward and backward passes. The recurrence is rewritten as
with analogous masking for . This stochastically drops recurrent weights across batches, preventing parameter co-adaptation over time steps. No changes are needed to cuDNN kernels aside from substituting with .
3. Model Hyperparameters and Regularization
The weight-dropped LSTM is typically integrated into a broader regularization and optimization pipeline. The key hyperparameters, as empirically tuned on Penn Treebank (PTB) and WikiText-2 (WT2), are:
| Parameter | PTB Value | WT2 Value |
|---|---|---|
| # layers (stacked LSTM) | 3 | 3 |
| Hidden size | 1150 | 1150 |
| Embedding size | 400 | 400 |
| Input dropout (on ) | 0.4 | 0.65 |
| Inter-layer dropout | 0.3 | 0.3 |
| Output dropout | 0.4 | 0.4 |
| Embedding dropout | 0.1 | 0.1 |
| Recurrent DropConnect | 0.5 | 0.5 |
| Activation regularization | 2 | 2 |
| Temporal AR | 1 | 1 |
Variational dropout masks are used per-batch for input, inter-layer, and output connections. Embedding dropout is used on token embeddings. Weight tying is employed between input embedding and output softmax layers.
Sequence-length jitter is introduced for BPTT: with 95% probability, else , and the learning rate is scaled by .
4. Optimization: Non-Monotonic Triggered ASGD
Optimization is performed by NT-ASGD, a variant of Averaged Stochastic Gradient Descent (ASGD) wherein the averaging trigger is determined by a non-monotonic validation error criterion. Define as parameters at step , with updates:
with fixed . Every epoch, the validation perplexity is logged. Averaging commences when , setting trigger . The final parameters are
This adaptive mechanism removes the need for manual selection of as in classical ASGD.
5. Experimental Protocol
Experiments are conducted on Penn Treebank and WikiText-2, preprocessed to a vocabulary of 10k (PTB) and ~33k (WT2). Training employs 40 (PTB) or 80 (WT2) batch size, 750 epochs to reach the NT-ASGD trigger, and max gradient norm of 0.25 for clipping. After NT-ASGD, a single ASGD fine-tuning pass with an analogous trigger rule (but ) is applied. Computation is executed on NVIDIA GPUs using cuDNN LSTM for efficiency.
6. Empirical Results
Single-model perplexities for AWD-LSTM (ASGD Weight-Dropped LSTM) are:
| Dataset | Model | #Params | Val PPL | Test PPL |
|---|---|---|---|---|
| PTB | AWD-LSTM (3×1150, drop-drop) | 24M | 60.0 | 57.3 |
| PTB | AWD-LSTM + neural cache | 24M | 53.9 | 52.8 |
| WT2 | AWD-LSTM (3×1150, increased inp drop) | 33M | 68.6 | 65.8 |
| WT2 | AWD-LSTM + neural cache | 33M | 53.8 | 52.0 |
DropConnect applied to recurrent weights directly yields these state-of-the-art perplexity results. Removing weight-dropped recurrence resulted in a degradation of ~11 points (PTB) and ~9 points (WT2); omitting embedding or variational dropout yielded 2–6 point increases, indicating weight-dropped regularization is critical for generalization.
7. Analysis and Implications
Weight-dropping regularizes the recurrent transition dynamics at the parameter level, analogously to variational dropout on hidden activations. By masking a subset of 's weights per batch, it prevents overfitting to specific recurrent pathways and mitigates co-adaptation. The method is fully compatible with optimized LSTM kernels, since the masked weights are computed once per batch and require only a single additional elementwise multiply per gate.
The approach acts synergistically with other regularization techniques (embedding dropout, AR/TAR penalties, weight tying), substantially improving long-range stability and generalization, as evidenced by ablation studies and held-out perplexity metrics (Merity et al., 2017). This suggests weight-dropped LSTM variants are well-suited for language modeling applications demanding robustness to overfitting, especially in low-resource or small-vocabulary settings.