Multi-Frequency LSTM Models

Updated 4 March 2026

Multi-frequency LSTM is a recurrent model that integrates multiple timescales to represent diverse sequential dynamics in language data.
It employs inverse-gamma distributed forget-gate biases and multiplicative gating to optimize memory retention and improve perplexity in language tasks.
The architecture offers clear interpretability by routing linguistic information based on word frequency, leading to measurable gains over standard LSTM models.

Multi-Frequency Long Short-Term Memory (mLSTM) refers to architectural enhancements of standard Long Short-Term Memory (LSTM) networks in order to explicitly capture, model, and route information at multiple timescales or frequency bands within sequential data. Multiple mLSTM variants exist, notably multi-timescale LSTM models motivated by the statistical properties of natural language, and multiplicative LSTM architectures featuring input-dependent second-order gating. Both lines of research target improved memory dynamics and input sensitivity, yielding performance and interpretability gains in recurrent language modeling (Mahto et al., 2020, Maupomé et al., 2019).

1. Timescale Dynamics in LSTM Networks

Traditional LSTMs update their memory cell $c_t$ as

$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t,$

with forget gate $f_t=\sigma(W_{fx}x_t+W_{fh}h_{t-1}+b_f)$ and input gate $i_t=\sigma(\cdots+b_i)$ . The effective memory timescale $\tau$ of a cell is governed by the forget-gate bias $b_f$ : $\tau = -\frac{1}{\ln f_0} = -\frac{1}{\ln \sigma(b_f)},$ where $f_0$ is the stationary value of the forget gate. This establishes that the forget-gate bias directly controls the exponential decay rate of cell-state memory, quantifying each unit's memory retention span (Mahto et al., 2020).

2. Power-Law Decay and Inverse-Gamma Timescale Priors

Observations of natural language reveal mutual information between words, MI $(w_k, w_{k+t})$ , decays as a power law $t^{-d}$ rather than an exponential. A single-timescale LSTM cannot match this decay profile. However, a heterogeneous ensemble of LSTM units each with its own timescale $\tau_i$ drawn from a distribution $P(\tau)$ results in an expected memory decay

$E_\tau[e^{-t/\tau}] = \int_0^\infty P(\tau) e^{-t/\tau} d\tau \propto t^{-d}.$

Through analysis via the Gamma-function identity and variable substitution, the distribution $P(\tau)$ that achieves the desired power-law decay is the inverse-gamma distribution: $P(\tau; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \tau^{-(\alpha+1)} \exp(-\beta/\tau),$ with $\alpha=d$ and $\beta=1$ in the canonical case (Mahto et al., 2020).

3. Multi-Timescale mLSTM Construction and Regularization

To operationalize this theory, a multi-timescale LSTM (editor’s term for this context: "multi-frequency LSTM") is constructed as follows:

Layer 1: Fast-timescale layer (1150 units), forget-gate biases fixed for short $\tau$ (e.g., 3 and 4).
Layer 2: Inverse-gamma timescale layer (1150 units), each unit assigned fixed $\tau_i \sim \mathrm{InvGamma}(\alpha=0.56,\beta=1)$ .
Layer 3: Trainable timescale layer (400 units), forget-gate biases initialized randomly and optimized during training.

For units with timescales under inverse-gamma priors, regularization is imposed by adding a negative-log-prior penalty to the loss: $\mathcal{L}_{\rm reg} = \lambda\sum_{i=1}^N [(\alpha+1)\ln T_i + \beta/T_i] + \text{const.},$ where $T_i = -1/\ln\sigma(b_{f,i})$ . The regularizer hyperparameter $\lambda$ controls adherence to the timescale distribution (Mahto et al., 2020).

4. Routing of Linguistic Information by Timescale

Empirical ablation reveals that the multi-timescale structure enables interpretable routing of word types:

Removal of large- $\tau$ units (long memory) primarily degrades rare (open-class) word prediction.
Removal of medium- $\tau$ units reduces performance on mid-frequency words.
Removal of small- $\tau$ units impacts common (closed-class) words.

This differential routing underscores functional specialization of memory channels for distinct frequency regimes, connecting unit-level memory dynamics directly to linguistic properties (Mahto et al., 2020).

5. Empirical Outcomes in Language Modeling

Quantitative evaluation demonstrates clear benefits of the multi-timescale/multi-frequency architecture:

Dataset	Baseline LSTM PPL	mLSTM PPL	$\Delta$ PPL
Penn Treebank (PTB)	61.40	59.69	1.71
WikiText-2 (WT2)	69.88	68.08	1.81

On rare-word subsets, perplexity reductions are even more substantial:

PTB: PPL drops from 2252.5 (vanilla LSTM) to 2100.9 (mLSTM).
WT2: 4631.1 $\rightarrow$ 4318.7.

On the Dyck-2 formal grammar task, correct sequence accuracy increases (baseline: 91.66%; mLSTM: 93.82%), with myLSTM outperforming baselines especially on long-range bracket dependencies (Mahto et al., 2020).

6. Multiplicative mLSTM: Architecture and Benefits

A distinct lineage of "multiplicative LSTM" (mLSTM) focuses on augmenting LSTM cells with an input-dependent, second-order intermediate vector $m_t$ constructed via

$m_t = (W_x x_t) \circ (W_h h_{t-1}),$

where $\circ$ denotes element-wise multiplication. This $m_t$ is injected into all gate computations, yielding gate values such as

$i_t = \sigma(W_i h_{t-1} + V_i m_t),$

with analogous expressions for $f_t$ and $o_t$ . The candidate update combines both linear and multiplicative terms. The key difference from a standard LSTM is that all gates share the same factorized $m_t$ , reducing parameter count while retaining input-adaptive transitions (Maupomé et al., 2019).

This architecture provides compact models that, in language modeling benchmarks, achieve test bits-per-character (BPC) superior to standard LSTMs at the same or lower parameter budgets:

Penn Treebank ( $\sim$ 292K params): LSTM, 1.38 BPC; mLSTM, 1.11 BPC.
Text8 ( $\sim$ 133K params): LSTM, 1.43 BPC; mLSTM, 1.37 BPC.

Trade-offs include increased computational cost per time step, slightly reduced per-gate specialization, and more complex implementation. Nevertheless, the shared second-order term appears to regularize and enhance modeling capacity on character-level tasks (Maupomé et al., 2019).

7. Conclusion and Practical Takeaways

Multi-frequency LSTM architectures, whether via explicit multi-timescale distributional design or input-dependent multiplicative gating, enable recurrent models to more faithfully represent the diverse temporal dependencies of natural language. Using fixed and sampled forget-gate biases to engineer targeted memory timescales, combined with appropriate regularization, reliably improves perplexity—especially for long-range and low-frequency phenomena—and facilitates direct interpretability of how the network routes information. Multiplicative LSTM models further demonstrate that sharing an input-sensitive second-order signal across all gates provides parameter-efficient modeling capacity with empirically validated gains in language modeling (Mahto et al., 2020, Maupomé et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

Multi-timescale Representation Learning in LSTM Language Models (2020)

Multiplicative Models for Recurrent Language Modeling (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Frequency Long Short-Term Memory (mLSTM).