Weight-Dropped LSTM

Updated 16 March 2026

Weight-dropped LSTM is a recurrent network that applies DropConnect to hidden-to-hidden matrices to reduce overfitting in sequence models.
The method integrates stochastic recurrent weight masking with conventional LSTM recurrence, yielding state-of-the-art perplexity on PTB and WikiText-2 datasets.
Empirical results demonstrate that weight-dropping improves long-range temporal stability and overall generalization without disrupting optimized LSTM kernels.

A weight-dropped LSTM is a Long Short-Term Memory (LSTM) network in which the recurrent (hidden-to-hidden) weight matrices are regularized using DropConnect. This technique, introduced in "Regularizing and Optimizing LSTM LLMs" (Merity et al., 2017), aims to improve generalization in sequence modeling—particularly word-level language modeling—by stochastically masking elements of the recurrent matrices during training. This approach focuses on mitigating overfitting and enhancing long-range temporal stability without interfering with black-box, highly-optimized LSTM kernels such as those in cuDNN.

1. Standard LSTM Recurrence Relations

The conventional LSTM processes sequential data via the following recurrence:

$\begin{aligned} i_t &= \sigma(W^i x_t + U^i h_{t-1} + b^i) \ f_t &= \sigma(W^f x_t + U^f h_{t-1} + b^f) \ o_t &= \sigma(W^o x_t + U^o h_{t-1} + b^o) \ \tilde{c}_t &= \tanh(W^c x_t + U^c h_{t-1} + b^c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \ h_t &= o_t \odot \tanh(c_t) \end{aligned}$

Here, $x_t \in \mathbb{R}^d$ is the input at time $t$ , $h_t, c_t \in \mathbb{R}^H$ the hidden and cell state, $W^\cdot$ and $U^\cdot$ the input-to-hidden and hidden-to-hidden weight matrices, $b^\cdot$ biases, $\sigma$ the sigmoid nonlinearity, and $\odot$ the elementwise product.

2. DropConnect-Based Recurrent Regularization

In a weight-dropped LSTM, DropConnect is applied to each hidden-to-hidden matrix $U^\cdot$ . For each $U \in \{U^i, U^f, U^o, U^c\}$ , a binary mask $M$ is sampled:

$M_{jk} \sim \mathrm{Bernoulli}(1 - p_{\mathrm{rec}})$

yielding a masked recurrent matrix

$\widetilde{U} = U \odot M$

Throughout a given batch, $\widetilde{U}$ remains fixed for both forward and backward passes. The recurrence is rewritten as

$i_t = \sigma\bigl(W^i x_t + (U^i \odot M^i)\, h_{t-1} + b^i\bigr)$

with analogous masking for $U^f, U^o, U^c$ . This stochastically drops recurrent weights across batches, preventing parameter co-adaptation over time steps. No changes are needed to cuDNN kernels aside from substituting $U$ with $\widetilde{U}$ .

3. Model Hyperparameters and Regularization

The weight-dropped LSTM is typically integrated into a broader regularization and optimization pipeline. The key hyperparameters, as empirically tuned on Penn Treebank (PTB) and WikiText-2 (WT2), are:

Parameter	PTB Value	WT2 Value
# layers (stacked LSTM)	3	3
Hidden size $H$	1150	1150
Embedding size $d$	400	400
Input dropout (on $x_t$ )	0.4	0.65
Inter-layer dropout	0.3	0.3
Output dropout	0.4	0.4
Embedding dropout $p_e$	0.1	0.1
Recurrent DropConnect $p_{rec}$	0.5	0.5
Activation regularization $\alpha$	2	2
Temporal AR $\beta$	1	1

Variational dropout masks are used per-batch for input, inter-layer, and output connections. Embedding dropout is used on token embeddings. Weight tying is employed between input embedding and output softmax layers.

Sequence-length jitter is introduced for BPTT: $L \sim \mathcal{N}(70, 5)$ with 95% probability, else $L \sim \mathcal{N}(35, 5)$ , and the learning rate is scaled by $L/70$ .

4. Optimization: Non-Monotonic Triggered ASGD

Optimization is performed by NT-ASGD, a variant of Averaged Stochastic Gradient Descent (ASGD) wherein the averaging trigger is determined by a non-monotonic validation error criterion. Define $w_k$ as parameters at step $k$ , with updates:

$w_{k+1} = w_k - \gamma \, \widehat{\nabla} f(w_k)$

with fixed $\gamma = 30$ . Every epoch, the validation perplexity $v_t$ is logged. Averaging commences when $v_t > \min \{ v_{t-5}, \ldots, v_{t-1} \}$ , setting trigger $T = k$ . The final parameters are

$w_\mathrm{avg} = \frac{1}{K - T + 1} \sum_{i=T}^K w_i$

This adaptive mechanism removes the need for manual selection of $T$ as in classical ASGD.

5. Experimental Protocol

Experiments are conducted on Penn Treebank and WikiText-2, preprocessed to a vocabulary of 10k (PTB) and ~33k (WT2). Training employs 40 (PTB) or 80 (WT2) batch size, 750 epochs to reach the NT-ASGD trigger, and max gradient norm of 0.25 for clipping. After NT-ASGD, a single ASGD fine-tuning pass with an analogous trigger rule (but $T=0$ ) is applied. Computation is executed on NVIDIA GPUs using cuDNN LSTM for efficiency.

6. Empirical Results

Single-model perplexities for AWD-LSTM (ASGD Weight-Dropped LSTM) are:

Dataset	Model	#Params	Val PPL	Test PPL
PTB	AWD-LSTM (3×1150, drop-drop)	24M	60.0	57.3
PTB	AWD-LSTM + neural cache	24M	53.9	52.8
WT2	AWD-LSTM (3×1150, increased inp drop)	33M	68.6	65.8
WT2	AWD-LSTM + neural cache	33M	53.8	52.0

DropConnect applied to recurrent weights directly yields these state-of-the-art perplexity results. Removing weight-dropped recurrence resulted in a degradation of ~11 points (PTB) and ~9 points (WT2); omitting embedding or variational dropout yielded 2–6 point increases, indicating weight-dropped regularization is critical for generalization.

7. Analysis and Implications

Weight-dropping regularizes the recurrent transition dynamics at the parameter level, analogously to variational dropout on hidden activations. By masking a subset of $U$ 's weights per batch, it prevents overfitting to specific recurrent pathways and mitigates co-adaptation. The method is fully compatible with optimized LSTM kernels, since the masked weights are computed once per batch and require only a single additional elementwise multiply per gate.

The approach acts synergistically with other regularization techniques (embedding dropout, AR/TAR penalties, weight tying), substantially improving long-range stability and generalization, as evidenced by ablation studies and held-out perplexity metrics (Merity et al., 2017). This suggests weight-dropped LSTM variants are well-suited for language modeling applications demanding robustness to overfitting, especially in low-resource or small-vocabulary settings.

Markdown Report Issue Upgrade to Chat

References (1)

Regularizing and Optimizing LSTM Language Models (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weight-Dropped LSTM.