xLSTM 7B Architecture Overview
- xLSTM 7B architecture is a large-scale recurrent neural network designed for language modeling, extending classic LSTM principles with modern stabilization and parallelization techniques.
- It employs 32 blocks of multi-head mLSTM cells combined with SwiGLU-activated MLPs to achieve efficient inference and constant per-token memory usage.
- Empirical results demonstrate that xLSTM 7B outperforms Transformer models in inference speed and memory usage while exhibiting favorable scaling laws and training stability.
The xLSTM 7B architecture is a large-scale recurrent neural network pretrained for language modeling and other sequence modeling tasks. xLSTM generalizes and extends classic LSTM principles using modern stabilization, normalization, and parallelization techniques. At the billion-parameter scale, it achieves competitive or superior efficiency and performance relative to Transformer models, especially with respect to inference speed and memory usage on long contexts. xLSTM-7B leverages multi-head matrix-memory LSTM cells (mLSTM), advanced gating, fused GPU kernels, and has been evaluated to demonstrate state-of-the-art inference efficiency and strong scaling behavior on extensive language modeling benchmarks (Beck et al., 17 Mar 2025, Beck et al., 2 Oct 2025, Beck et al., 2024).
1. Architectural Overview and Parameterization
xLSTM-7B is constructed as a stack of 32 blocks, each consisting of an mLSTM sub-layer and a SwiGLU-activated (Swish-Gated Linear Unit) MLP. Embedding and hidden dimensions, gate subspaces, and per-head allocations are selected to maximize computational throughput while scaling to billions of parameters.
Parameter Summary Table:
| Component | Quantity/Dimension | Parameter Count (Approx.) |
|---|---|---|
| Vocabulary | – | |
| Model dimension | – | |
| Number of xLSTM blocks | – | |
| Heads per block | – | |
| Hidden per head | – | |
| Key/query per head | – | |
| Feed-forward width | – | |
| Embedding-In/Out | each | 206M each |
| Block modules (32×) | see below | 6.44B |
| Total Parameter Count | – | 06.865 \times 109d=4096$1 (i.e., $d=4096$2). Specialized fused GPU kernels are applied for inference efficiency (Beck et al., 17 Mar 2025).
2. Internal Block Algorithms and Cell EquationsWithin each block, the model applies sequence processing using multi-head mLSTM cells followed by MLP nonlinearity. The update for each head at step $d=4096$3 is as follows (from (Beck et al., 17 Mar 2025)): Given previous state $d=4096$4 and input $d=4096$5, the steps are $d=4096$6 where Norm is LayerNorm or RMSNorm depending on position. The MLP sub-layer applies SwiGLU activation: $d=4096$7 xLSTM blocks are pre-normed and use RMSNorm at block input, with optional head-wise LayerNorm inside blocks. 3. Computational Complexity and Inference ScalingThe xLSTM-7B architecture is optimized for both training and inference efficiency:
Empirical results demonstrate that xLSTM-7B is 1.5–2× faster than Transformer-based 7B LLMs for single-batch 1k-token generations, with 50–70% lower peak GPU memory usage (Beck et al., 17 Mar 2025). 4. Scaling Laws and Empirical PerformancexLSTM-7B exhibits favorable scaling characteristics when compared to Transformers (Beck et al., 2 Oct 2025):
$N=32$6 with fitted exponents $N=32$7, $N=32$8, $N=32$9, $H=8$0 (here, $H=8$1 is parameter count, $H=8$2 is training tokens).
$H=8$3 where $H=8$4 is compute budget in FLOPs.
5. Model Optimization and Training ConfigurationThe xLSTM-7B model is pretrained with high-efficiency methods supporting extremely large parameter and data regimes:
This regimen supports stable convergence at billion-parameter scale, avoiding vanishing/exploding gradients via gate-stabilization and normalization protocols (Beck et al., 17 Mar 2025). 6. Comparison to Transformers and Prior Recurrent ModelsRelative to Transformer architectures, xLSTM-7B’s distinguishing features include:
By contrast, Transformers exhibit $d_{hv}=512$8 memory scaling due to their KV-cache for context length $d_{hv}=512$9, and $d_{qk}=256$0 compute/memory per layer when recomputation is required (Beck et al., 2 Oct 2025, Beck et al., 17 Mar 2025). 7. Model Variants, Limitations, and ContextWhile foundational papers on xLSTM (e.g., (Beck et al., 2024)) explored up to 2.7B parameter variants, the 7B configuration is supported by both dedicated benchmarking (Beck et al., 17 Mar 2025) and scaling-law analysis (Beck et al., 2 Oct 2025). Alternate scaling variants extrapolate depth, width, and mixture ratios of mLSTM:sLSTM blocks (e.g., mLSTM-dominant 7:1 ratios, as in (Beck et al., 2024)). Notably, LRAM models for robotics tasks leverage smaller xLSTM backbones (max 206M parameters), illuminating the versatility and extensibility of the xLSTM design, though these uses do not describe a 7B instantiation (Schmied et al., 2024). xLSTM-7B establishes a plug-and-play recurrent LLM baseline for scenarios requiring high-throughput inference or long-context information mixing, with empirical and architectural advantages relative to contemporary Transformer and state-space models. Its open-source model code and weights further support the reproducibility and extensibility of this approach (Beck et al., 17 Mar 2025). Sign up for free to explore the frontiers of research
Discover trending papers, chat with arXiv, and track the latest research shaping the future of science and technology.
Discover trending papers, chat with arXiv, and more.
|