Papers
Topics
Authors
Recent
Search
2000 character limit reached

xLSTM 7B Architecture Overview

Updated 4 June 2026
  • xLSTM 7B architecture is a large-scale recurrent neural network designed for language modeling, extending classic LSTM principles with modern stabilization and parallelization techniques.
  • It employs 32 blocks of multi-head mLSTM cells combined with SwiGLU-activated MLPs to achieve efficient inference and constant per-token memory usage.
  • Empirical results demonstrate that xLSTM 7B outperforms Transformer models in inference speed and memory usage while exhibiting favorable scaling laws and training stability.

The xLSTM 7B architecture is a large-scale recurrent neural network pretrained for language modeling and other sequence modeling tasks. xLSTM generalizes and extends classic LSTM principles using modern stabilization, normalization, and parallelization techniques. At the billion-parameter scale, it achieves competitive or superior efficiency and performance relative to Transformer models, especially with respect to inference speed and memory usage on long contexts. xLSTM-7B leverages multi-head matrix-memory LSTM cells (mLSTM), advanced gating, fused GPU kernels, and has been evaluated to demonstrate state-of-the-art inference efficiency and strong scaling behavior on extensive language modeling benchmarks (Beck et al., 17 Mar 2025, Beck et al., 2 Oct 2025, Beck et al., 2024).

1. Architectural Overview and Parameterization

xLSTM-7B is constructed as a stack of 32 blocks, each consisting of an mLSTM sub-layer and a SwiGLU-activated (Swish-Gated Linear Unit) MLP. Embedding and hidden dimensions, gate subspaces, and per-head allocations are selected to maximize computational throughput while scaling to billions of parameters.

Parameter Summary Table:

Component Quantity/Dimension Parameter Count (Approx.)
Vocabulary V=50, ⁣257V=50,\!257
Model dimension d=4096d=4096
Number of xLSTM blocks N=32N=32
Heads per block H=8H=8
Hidden per head dhv=512d_{hv}=512
Key/query per head dqk=256d_{qk}=256
Feed-forward width dff=10944d_\mathrm{ff}=10944
Embedding-In/Out V×dV \times d each \approx206M each
Block modules (32×) see below \approx6.44B
Total Parameter Count d=4096d=409606.865 \times 109</sup></td></tr></tbody></table></div><p>EachmLSTMblockconsistsof<ahref="https://www.emergentmind.com/topics/additiveparallelcorrection"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">parallel</a>inputgatedandforgetgatedmatrix<ahref="https://www.emergentmind.com/topics/memorychannels"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">memorychannels</a>.Thefeedforwardsubmoduleineachblockisa<ahref="https://www.emergentmind.com/topics/swishgatedlinearunitswiglu"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">SwiGLU</a>MLPwithanexpansionfactorof</sup></td> </tr> </tbody></table></div> <p>Each mLSTM block consists of <a href="https://www.emergentmind.com/topics/additive-parallel-correction" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">parallel</a> input-gated and forget-gated matrix <a href="https://www.emergentmind.com/topics/memory-channels" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">memory channels</a>. The feed-forward sub-module in each block is a <a href="https://www.emergentmind.com/topics/swish-gated-linear-unit-swiglu" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">SwiGLU</a> MLP with an expansion factor of d=4096$1 (i.e., $d=4096$2). Specialized fused GPU kernels are applied for inference efficiency (Beck et al., 17 Mar 2025).

2. Internal Block Algorithms and Cell Equations

Within each block, the model applies sequence processing using multi-head mLSTM cells followed by MLP nonlinearity. The update for each head at step $d=4096$3 is as follows (from (Beck et al., 17 Mar 2025)):

Given previous state $d=4096$4 and input $d=4096$5, the steps are

$d=4096$6

where Norm is LayerNorm or RMSNorm depending on position.

The MLP sub-layer applies SwiGLU activation: $d=4096$7

xLSTM blocks are pre-normed and use RMSNorm at block input, with optional head-wise LayerNorm inside blocks.

3. Computational Complexity and Inference Scaling

The xLSTM-7B architecture is optimized for both training and inference efficiency:

  • Training (parallel/chunkwise across sequence): Each layer performs $d=4096$8 operations for sequence length $d=4096$9 and model dimension $N=32$0, aggregating to $N=32$1. Activations and parameter memory cost scale as $N=32$2.
  • Inference (recurrent, token-by-token): At each decoding step, computation per token is $N=32$3—strictly constant in $N=32$4. Memory required for hidden state, matrix memory, and normalizers is $N=32$5, independent of sequence length. There is no KV cache or context-size-dependent memory growth.

Empirical results demonstrate that xLSTM-7B is 1.5–2× faster than Transformer-based 7B LLMs for single-batch 1k-token generations, with 50–70% lower peak GPU memory usage (Beck et al., 17 Mar 2025).

4. Scaling Laws and Empirical Performance

xLSTM-7B exhibits favorable scaling characteristics when compared to Transformers (Beck et al., 2 Oct 2025):

  • Parametric scaling law:

$N=32$6

with fitted exponents $N=32$7, $N=32$8, $N=32$9, $H=8$0 (here, $H=8$1 is parameter count, $H=8$2 is training tokens).

  • IsoFLOP scaling (compute-optimality):

$H=8$3

where $H=8$4 is compute budget in FLOPs.

  • Compute-optimal regime: For 7B parameters, $H=8$5 tokens per parameter ($H=8$6 tokens).
  • In practice, xLSTM achieves better or comparable cross-entropy loss at fixed compute relative to GPT-style Transformers, with benefits increasingly pronounced as context length increases (Beck et al., 2 Oct 2025).

5. Model Optimization and Training Configuration

The xLSTM-7B model is pretrained with high-efficiency methods supporting extremely large parameter and data regimes:

  • Hardware: 128 × NVIDIA H100 GPUs using FSDP with activation checkpointing.
  • Training tokens: Approximately 2.3 trillion over 550k steps; batch size ramps from 128 to 512.
  • Context: Context length set to 8192, with final “cool-down” training at 32,768 tokens.
  • Optimizer: AdamW ($H=8$7, $H=8$8, $H=8$9, weight decay $d_{hv}=512$0), peak LR $d_{hv}=512$1.
  • LR scheduling: Warmup (3k steps) $d_{hv}=512$2 exponential decay (to 10% at 500k) $d_{hv}=512$3 linear cool-down.
  • Initialization: Gate bias $d_{hv}=512$4 for stability; soft-capping of gates at $d_{hv}=512$5; logits capped at $d_{hv}=512$6.

This regimen supports stable convergence at billion-parameter scale, avoiding vanishing/exploding gradients via gate-stabilization and normalization protocols (Beck et al., 17 Mar 2025).

6. Comparison to Transformers and Prior Recurrent Models

Relative to Transformer architectures, xLSTM-7B’s distinguishing features include:

  • Constant per-token inference time and memory: State size in $d_{hv}=512$7, no dependence on sequence length.
  • Absence of KV cache: Eliminates context-length–dependent memory cost entirely.
  • Parallelizable training: Chunkwise parallel kernels allow throughput comparable to attention mechanisms.
  • Headwise recurrent memory matrices (fast-weight mechanism): Each head learns and updates a covariance-style memory.
  • Normalization and block design: Pre-norm RMSNorm and per-head LayerNorm yield stable optimization even at scale.

By contrast, Transformers exhibit $d_{hv}=512$8 memory scaling due to their KV-cache for context length $d_{hv}=512$9, and $d_{qk}=256$0 compute/memory per layer when recomputation is required (Beck et al., 2 Oct 2025, Beck et al., 17 Mar 2025).

7. Model Variants, Limitations, and Context

While foundational papers on xLSTM (e.g., (Beck et al., 2024)) explored up to 2.7B parameter variants, the 7B configuration is supported by both dedicated benchmarking (Beck et al., 17 Mar 2025) and scaling-law analysis (Beck et al., 2 Oct 2025). Alternate scaling variants extrapolate depth, width, and mixture ratios of mLSTM:sLSTM blocks (e.g., mLSTM-dominant 7:1 ratios, as in (Beck et al., 2024)). Notably, LRAM models for robotics tasks leverage smaller xLSTM backbones (max 206M parameters), illuminating the versatility and extensibility of the xLSTM design, though these uses do not describe a 7B instantiation (Schmied et al., 2024).

xLSTM-7B establishes a plug-and-play recurrent LLM baseline for scenarios requiring high-throughput inference or long-context information mixing, with empirical and architectural advantages relative to contemporary Transformer and state-space models. Its open-source model code and weights further support the reproducibility and extensibility of this approach (Beck et al., 17 Mar 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to xLSTM 7B Architecture.