xLSTM-mLSTM: Advanced Recurrent Memory Models

Updated 14 December 2025

xLSTM-mLSTM are advanced recurrent architectures that combine exponential gating and matrix-memory cells to enable robust long-context sequence modeling.
They utilize Tiled Flash Linear Attention and mixture-of-experts routing to achieve superior computational efficiency and parallelism compared to traditional models.
Empirical results show these models significantly improve perplexity and inference speed, making them effective for tasks like language modeling and sentiment analysis.

xLSTM-mLSTM refers to a lineage of recurrent neural architectures—Extended Long Short-Term Memory (xLSTM) and its matrix-memory variant (mLSTM)—that are designed to combine high-capacity associative memory and efficient, fully parallelizable sequence modeling. These models leverage exponential gating and advanced memory update rules to achieve performance and efficiency on par with, or surpassing, contemporary Transformer and State-Space Models. Recent innovations such as Tiled Flash Linear Attention (TFLA) and mixture-of-experts frameworks further enhance their computational properties and application reach, enabling efficient long-context processing and specialized token routing.

1. Architectural Foundations and Mathematical Formulation

xLSTM generalizes the classic LSTM by introducing exponential gating and novel memory structures. The core innovations are:

Exponential gating. Gates may use $\exp(\tilde g_t)$ rather than $\sigma(\tilde g_t)$ , enabling channels to multiplicatively accumulate or reset information across long sequences. Stabilization with a max-gate ( $m_t$ ) prevents numerical overflow.
Matrix-memory cell (mLSTM). The cell state $C_t \in \mathbb{R}^{d_{qk} \times d_{hv}}$ encodes key–value outer products:

$C_t = f_t C_{t-1} + i_t (k_t v_t^\top)$

with accompanying normalizer $n_t$ and stabilizer $m_t$ for controlled memory updates.

Readout. Recurrent output is normalized:

$\widetilde h_t = \frac{C_t^\top \left(q_t/\sqrt{d_{qk}}\right)}{\max\left\{|n_t^\top(q_t/\sqrt{d_{qk}})|, \exp(-m_t)\right\}}$

where $h_t = o_t \odot \mathrm{NORM}(\widetilde h_t)$ after RMSNorm or LayerNorm.

A "sigmoid-gate" mLSTM variant omits $n_t, m_t$ for reduced computation:

$i_t = \sigma(\tilde i_t),\quad f_t = \sigma(\tilde f_t),\quad o_t = \sigma(\tilde o_t) \ C_t = f_t C_{t-1} + i_t (k_t v_t^\top),\quad \widetilde h_t = C_t^\top (q_t/\sqrt{d_{qk}})$

These update rules enable xLSTM/mLSTM cells to revise distant memory rapidly and robustly—a significant enhancement over prior additive memory architectures (Beck et al., 7 May 2024).

2. Computational Efficiency and Kernel Innovations

Traditional parallelization of recurrent networks is handicapped by sequential hidden-to-hidden recurrence. mLSTM eliminates this by removing $h_{t-1} \rightarrow C_t$ feedback, permitting fully parallel computation via matrix operations (Beck et al., 7 May 2024). Further, the Tiled Flash Linear Attention (TFLA) kernel establishes new state-of-the-art runtimes for linear RNNs:

Chunkwise and intra-chunk tiling. TFLA splits the sequence into $N_c = \lceil T/L\rceil$ chunks (length $L$ ), then performs tiled matrix multiplications inside each chunk, maximizing on-chip SRAM reuse and arithmetic intensity.
Memory I/O optimization. Only $O(T/L)$ states ( $C_{k-1}, n_{k-1}, m_{k-1}$ ) are materialized in HBM, sharply reducing bandwidth demands.
Complexity interpolation. TFLA achieves $O(TLd)$ FLOPs and $O(TLd)$ memory I/O, interpolating between sequential ( $L=1$ ) and fully parallel ( $L=T$ ) regimes.

Benchmarks on NVIDIA H100 demonstrate TFLA-mLSTM significantly outpaces FlashAttention, Mamba, and Simple FLA kernels for long-context processing, with TFLA-mLSTMsig forward runtime up to $2\times$ faster than Mamba-2 and $30\%$ faster than TFLA-mLSTMexp (Beck et al., 18 Mar 2025).

Seq Len	FlashAttn 3	Mamba 2	FLA (Simple)	xl_chunk mLSTMexp	xl_chunk mLSTMsig
1 024	2.8 ms	3.5 ms	2.7 ms	2.0 ms	1.3 ms
8 192	16.5 ms	21.2 ms	14.8 ms	10.2 ms	7.1 ms
32 768	52.6 ms	81.5 ms	36.4 ms	31.9 ms	20.4 ms

3. Theoretical Properties: Memory, Gating, and Expressivity

Second-order memory mixing. Unlike standard LSTM, mLSTM (and its fully factorized xLSTM version) combines inputs and hidden-states via Hadamard products in gate computation; for true xLSTM each gate has its own intermediate state $m^\star_t = (W_x^\star x_t) * (W_h^\star h_{t-1})$ (Maupomé et al., 2019).
Parameter sharing. mLSTM shares a single intermediate $m_t$ across gates, requiring fewer parameters; xLSTM factorizes each gate independently for marginal accuracy gain. On Penn Treebank and Text8, fully factorized xLSTM yields $\approx$ 0.02 BPC improvement over mLSTM at comparable parameter count (Maupomé et al., 2019).
Exponential gating and normalizer. The use of exponential gates and cumulative normalizer $n_t$ ensures numerical stability and bounded hidden-state updates over extremely long contexts (Beck et al., 7 May 2024).

A plausible implication is that this architectural design yields robust associative recall and mitigates the vanishing-gradient problem more effectively than additive or sigmoid-only gating.

4. Mixture-of-Experts and Adaptive Routing: MoxE

MoxE (Thiombiano et al., 1 May 2025) extends xLSTM/mLSTM block utility by routing tokens to expert modules based on estimated token “difficulty”:

Architecture. $E$ experts (half mLSTM, half sLSTM) are dynamically activated per token using an entropy-aware router. The router increases the activation probability for mLSTM experts as token entropy (difficulty) grows:

$P(\textrm{mLSTM} | d_t) / P(\textrm{sLSTM} | d_t) \approx \exp(2\gamma d_t)$

ensuring that rare or complex tokens preferentially utilize high-capacity mLSTM blocks.

Auxiliary losses. Entropy alignment, group-wise balance, Z-loss (logit control), and load-balancing penalties promote stable and efficient training.
Efficiency. MoxE activates only $K \ll E$ experts per token, yielding an effective cost of $O(n K/E)$ compared to dense MoEs.

Empirically, MoxE achieves superior perplexity (LAMBADA PPL reduction from $\sim80$ k to $65$k), robust group utilization, and a $4\times$ – $16\times$ efficiency gain in inference FLOPs over comparable dense xLSTM models (Thiombiano et al., 1 May 2025).

5. Applications in Language Modeling and Sequence Tasks

xLSTM-mLSTM models have been extensively benchmarked on autoregressive language modeling tasks, long-context extrapolation, and fine-grained aspect-based sentiment analysis:

Scaling properties. On 15B tokens (SlimPajama), pure mLSTM models achieve lower validation perplexity than Llama, Mamba, and RWKV-4 across model sizes (125M–2.7B parameters) (Beck et al., 7 May 2024).
Long-context extrapolation. At context lengths of 16K tokens, xLSTM yields PPL $\approx$ 9 vs. $>300$ for Llama and $\sim$ 14 for Mamba/RWKV (Beck et al., 7 May 2024).
Sentiment analysis. In MEGA (Lawan et al., 1 Jul 2025), a bi-directional mLSTM architecture combined with Multihead Exponential Gated Fusion (MECGAF) delivers state-of-the-art results in ABSA tasks. MECGAF efficiently integrates forward/global and partially flipped backward/local mLSTM streams via cross-head exponential gating, enabling nuanced modeling of both long-range and aspect-focused dependencies. Inference is $2$– $3\times$ faster than comparable Transformers, with parameter budgets $<5$ M and accuracy gains.

Model	Restaurant Acc	Laptop Acc	Twitter Acc	Inference Speed
MEGA (BERT)	87.72%	81.87%	78.54%	2-3x faster

6. Comparative Analysis: Memory, Parallelism, and Practical Trade-offs

The following table summarizes computational and memory characteristics of key kernel and model variants (Beck et al., 18 Mar 2025):

Kernel/Model	FLOPs Complexity	Memory I/O	Parallelism
FlashAttention	$\mathcal{O}(T^2 d)$	$\mathcal{O}(T^2)$	Full, expensive
Mamba	$\mathcal{O}(T d)$	Linear	Serial scan
FLA (chunkwise)	$\mathcal{O}(T L d)$	Linear in $L$	Chunkwise
TFLA	$\mathcal{O}(T L d)$	Tunable ( $L$ )	2-level, tiled
mLSTM (seq)	$\mathcal{O}(T d^2)$	$O(d^2)$ constant	Full, parallel

TFLA-mLSTM achieves true $\mathcal{O}(T)$ scaling, maximized tensor-core utilization, and low memory traffic—settable by chunk size parameter to optimally match device roofline.
Mamba and RWKV-4 offer higher efficiency in serial scan but are limited in parallelism.
Transformer provides full parallelism but quadratic complexity in context length.

The material supports the conclusion that xLSTM-mLSTM, with advanced kernel design and adaptive mixture routing, constitutes a family of recurrent models with state-of-the-art long-context efficiency, competitive scaling, and flexible trade-offs between memory capacity and computation (Beck et al., 18 Mar 2025, Beck et al., 7 May 2024, Thiombiano et al., 1 May 2025, Lawan et al., 1 Jul 2025, Maupomé et al., 2019).

7. Future Directions and Open Questions

Kernel optimization. Continued refinement of TFLA and related kernels may further reduce hardware bottlenecks and improve scaling for ultra-long contexts.
Mixture-of-Experts expansion. Leveraging more diverse expert types and richer routing mechanisms could expand the adaptability and generalization of xLSTM-based architectures.
Broader domains. Empirical success in language modeling, long-context reasoning, and ABSA suggests applicability to other sequential domains, such as time series forecasting and biosequence analysis.
Theoretical analysis. Further investigation into the expressive power and convergence guarantees of matrix-memory recurrent structures with exponential gating remains an active research direction.

A plausible implication is that xLSTM-mLSTM and their kernel and routing variants will increasingly serve as practical alternatives or complements to attention-based mechanisms, especially where sequence length, compute efficiency, and memory capacity are critical.