Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multiplicative LSTM (mLSTM) Overview

Updated 3 March 2026
  • Multiplicative LSTM (mLSTM) is a recurrent neural network that integrates input-dependent multiplicative interactions to enhance state transition expressivity.
  • It replaces additive hidden-to-hidden transitions with a factorized, low-rank tensor approach, enabling efficient parameter scaling and richer input-state dynamics.
  • Empirical results show mLSTM models excel in language modeling tasks with superior memory, generalization, and the ability to scale with parallelizable matrix-memory variants.

Multiplicative Long Short-Term Memory (mLSTM) is a recurrent neural network (RNN) architecture that extends standard long short-term memory (LSTM) networks by introducing input-dependent, multiplicatively-modulated transition functions. mLSTM architectures have been shown to achieve improved expressivity and superior empirical performance on autoregressive density estimation and sequence modeling benchmarks, and they have been further evolved for greater parallelism and memory scaling in large-scale language modeling.

1. Core mLSTM Architecture and Mathematical Formulation

The classical mLSTM cell, introduced by Krause et al. (2016), augments the standard LSTM update by incorporating a per-timestep, elementwise multiplicative interaction between the incoming input and the previous hidden state. The mathematical formulation is as follows (Krause et al., 2016):

mt=(Wmxxt)    (Wmhht1) h^t=Whxxt+Whmmt it=σ(Wixxt+Wimmt) ft=σ(Wfxxt+Wfmmt) ot=σ(Woxxt+Wommt) ct=ft    ct1+it    tanh(h^t) ht=ot    tanh(ct)\begin{aligned} m_t &= (W_{mx}\,x_t) \;\odot\; (W_{mh}\,h_{t-1}) \ \hat{h}_t &= W_{hx}\,x_t + W_{hm}\,m_t \ i_t &= \sigma(W_{ix}\,x_t + W_{im}\,m_t) \ f_t &= \sigma(W_{fx}\,x_t + W_{fm}\,m_t) \ o_t &= \sigma(W_{ox}\,x_t + W_{om}\,m_t) \ c_t &= f_t \;\odot\; c_{t-1} + i_t \;\odot\; \tanh(\hat{h}_t) \ h_t &= o_t \;\odot\; \tanh(c_t) \end{aligned}

Here xtx_t is the current input, ht1h_{t-1} is the previous hidden state, ct1c_{t-1} is the previous memory cell, and \odot denotes elementwise multiplication. mtm_t is the intermediate multiplicative state, shared across all gates.

This structure replaces the conventional additive hidden-to-hidden transitions in LSTM with a more expressive, input-dependent transformation. Importantly, this "low-rank tensor" factorization enables the recurrent transition matrix to adapt for each input token, while maintaining parameter efficiency.

2. Distinctions from Standard LSTM and Tensor RNNs

In standard LSTM, hidden-to-hidden transitions are governed by a single, fixed matrix (e.g., WhhW_{hh}), and gates aggregate xtx_t and ht1h_{t-1} linearly. In contrast, mLSTM introduces a second-order (multiplicative) term, mtm_t, which captures richer input–state interactions (Krause et al., 2016, Maupomé et al., 2019).

Directly learning a separate transition matrix per input symbol—as in full tensor RNNs—would result in an infeasible parameter explosion. mLSTM mitigates this by factorizing the transition:

Whh(xt)=Whmdiag(Wmxxt)WmhW_{hh}^{(x_t)} = W_{hm} \, \text{diag}(W_{mx} x_t) \, W_{mh}

This enables O(DV+D2)O(D \cdot |V| + D^2) parameter scaling, as opposed to O(D2V)O(D^2 \cdot |V|), broadening the space of possible hidden-state transitions without the cost of full parameterization.

3. Parameter Sharing and Expressivity

Notably, mLSTM uses a shared intermediate vector mtm_t across all gates and candidate computations. This parameter-sharing scheme halves the number of rank-one factors relative to a naive tensor approach and has been empirically shown to preserve model expressivity while reducing overfitting, especially in data-constrained regimes (Maupomé et al., 2019). The sharing of mtm_t allows all gates to exploit a unified, input-conditioned second-order encoding of (xt,ht1)(x_t, h_{t-1}). Experimental comparisons indicate that this does not degrade performance and can improve generalization.

4. Modern mLSTM Variants: Parallelizable Matrix Memory

Recent work further generalizes the mLSTM concept by replacing the scalar LSTM cell state with a matrix-valued memory, employing exponential gating, and enabling full parallelism. In the xLSTM framework, the mLSTM component stores key–value covariance matrices CtRd×d\mathbf{C}_t \in \mathbb{R}^{d \times d} and updates them as follows (Beck et al., 2024):

Ct=rf,tCt1+ri,t(vtkt) nt=rf,tnt1+ri,tkt h~t=(Ctqt)/max{ntqt,1} ht=ro,th~t\begin{aligned} \mathbf{C}_t &= r_{f,t}\,\mathbf{C}_{t-1} + r_{i,t}(\mathbf{v}_t\,\mathbf{k}_t^\top) \ \mathbf{n}_t &= r_{f,t}\,\mathbf{n}_{t-1} + r_{i,t}\,\mathbf{k}_t \ \tilde{\mathbf{h}}_t &= (\mathbf{C}_t\,\mathbf{q}_t) / \max\{|\mathbf{n}_t^{\top}\mathbf{q}_t|, 1\} \ \mathbf{h}_t &= r_{o,t} \odot \tilde{\mathbf{h}}_t \end{aligned}

Here, kt,vt,qt\mathbf{k}_t, \mathbf{v}_t, \mathbf{q}_t are learned projections of the input, and ri,t,rf,t,ro,tr_{i,t}, r_{f,t}, r_{o,t} are scalar input, forget, and output gates, the input and (optionally) forget gates realized as stabilized exponentials.

This design is fully parallelizable: the update for Ct\mathbf{C}_t does not rely on previous ht\mathbf{h}_t values, enabling efficient batched implementation, similar to attention mechanisms such as FlashAttention.

5. Training Methodologies and Regularization Strategies

Published mLSTM implementations employ several modern training techniques to maximize performance and stability (Krause et al., 2016):

  • Optimizer: Adam with learning rate scheduling.
  • Initialization: Scaled orthogonal for recurrent weights, Glorot initialization elsewhere; forget gate biases set to positive values (e.g., +3) for stability.
  • Truncated backpropagation through time (BPTT) with sequence lengths of 200–250.
  • Weight normalization applied to recurrent matrices.
  • Input embedding layers preceding mLSTM core.
  • Variational dropout, sharing dropout masks across full sequences, applied to both input embeddings and hidden states, with dropout probability scaled by model size.

These choices are crucial for preventing overfitting and achieving state-of-the-art bits-per-character performance.

6. Empirical Benchmarks and Comparative Performance

mLSTM models demonstrate superior performance on several standard character-level language modeling tasks. The following table summarizes selected results from (Krause et al., 2016, Maupomé et al., 2019):

Dataset Model Params Test BPC / PPL
Text8 mLSTM (reg, large) 46M 1.27 BPC
Text8 LSTM (deep) 1.36–1.43 BPC
Hutter Prize mLSTM (reg, large) 46M 1.24 BPC
Hutter Prize Stacked LSTM 1.53 BPC
WikiText-2 mLSTM (reg, large) 46M 1.26 BPC / 88.8 PPL
Penn Treebank mLSTM (292K params) 292K 1.11

More recent matrix-memory mLSTM (in xLSTM) achieves state-of-the-art performance in both synthetic tasks (e.g., large-scale associative recall up to 256 key-value pairs) and large-scale language modeling (e.g., validation perplexity 13.43 at 409M parameters, outperforming Llama and GPT-3 at comparable model sizes) (Beck et al., 2024). xLSTM with all mLSTM blocks maintains superior scaling-law behavior and robust context-length extrapolation, outperforming Transformers in memory-intensive tasks.

7. Architectural Implementation Considerations

Key points for effective mLSTM implementation include:

  • For classic mLSTM, set the hidden and multiplicative state dimensions equal; share mtm_t across gates to limit parameter growth.
  • Embedding folding avoids redundant layers at inference by absorbing linear embeddings into surrounding weight matrices.
  • Orthogonal initialization and positive forget biases are essential for convergence and stable long-term memory.
  • Variational dropout and weight normalization are required to reach best-in-class generalization.
  • Matrix-memory mLSTM necessitates O(d3)\mathcal{O}(d^3) memory and compute per layer, but gains parallelism analogous to modern attention mechanisms. Stabilization techniques (for exponential gates) and layer normalization are required for numerical robustness.

8. Relevance and Impact in Contemporary Sequence Modeling

mLSTM’s expressivity derives from input-conditioned transitions, mitigating the high correlation of hidden states seen in standard RNNs. The introduction of multiplicative terms enables rapid adaptation and recovery from sequence errors, particularly beneficial in character-level language modeling (Krause et al., 2016, Maupomé et al., 2019). In modern matrix-memory variants, mLSTM achieves parallelization previously unattainable with classical LSTM, closing performance gaps with state-of-the-art Transformer and state space models, particularly in memory, retrieval, and reasoning tasks (Beck et al., 2024). This suggests mLSTM is a viable foundation for large-scale sequence models beyond the scope of standard additive-gated architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiplicative Long Short-Term Memory (mLSTM).