Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference (2503.13427v1)

Published 17 Mar 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Recent breakthroughs in solving reasoning, math and coding problems with LLMs have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.

Introduction and Motivation

The xLSTM 7B architecture introduces a recurrent framework for LLMs that fundamentally rethinks inference efficiency for models with billions of parameters. By leveraging a variant of the LSTM that incorporates specialized optimizations, the architecture circumvents some of the quadratic compute and dynamic memory scaling issues associated with Transformer-based models. The model manifests a 7-billion-parameter configuration that achieves competitive downstream performance while offering significant improvements in generation speed and memory efficiency.

Architectural Innovations

xLSTM 7B employs several architectural refinements that are crucial for its efficient inference:

  • Optimized mLSTM Block:

The design shifts from a pre-up to a post-up projection strategy. The mLSTM cell now operates directly in the embedding space rather than a higher-dimensional representation. This is complemented by position-wise feed-forward modules (SwiGLU) interleaved after each mLSTM layer, targeting optimal FLOPs distribution and streamlined GPU utilization. Moreover, channel-wise convolutions and learnable skip connections are removed, simplifying the compute graph and improving parallelization.

  • Parallelization Strategies:

With independent gate pre-activations computed per head, xLSTM 7B aligns its model parallelism approaches with those of Transformer architectures. The carefully chosen configuration of 8 heads with a head dimension of 512, and an associated dqk=256d_{qk} = 256, strikes a balance between the expressive capacity and memory overhead.

  • Memory Efficiency through Recurrent Computation:

The recurrent mechanism enables constant memory usage during generation. Unlike Transformers, where the KV cache grows with sequence length, the recurrent design ensures that computation per token remains linear, which is critical for low-latency deployment.

  • Fused Generation Kernels:

To mitigate memory operations overhead, the design incorporates fused GPU kernels that combine outer-products, dot-products, and element-wise operations within the mLSTM cell. This integration results in reduced kernel launch overheads and heightened throughput during inference.

  • Stability Enhancements:

Replacing LayerNorm with RMSNorm and incorporating soft-capping on the input and forget gate pre-activations (via tanh non-linearity) improves stability during both training and generation. The negative initialization of the input gate bias further ensures that gradient norm spikes are controlled, contributing to more stable convergence properties.

Experimental Evaluation and Numerical Results

The experimental benchmark comparisons demonstrate robust performance metrics for xLSTM 7B:

  • Inference Speed and Throughput:

Empirical evaluation indicates that xLSTM 7B achieves approximately 50% faster generation speed compared to Mamba-based LLMs and exhibits similar or better performance relative to Llama-based models at prefill length 0. Additionally, the architecture offers about 70% higher prefill throughput in long-context scenarios, making it especially advantageous for tasks that exploit extended context windows.

  • Memory Footprint:

Recurrent operation ensures that the memory consumption remains constant with respect to sequence length. This results in a significantly smaller memory footprint compared to Transformer-based counterparts where the KV cache scales with sequence length, enabling efficient inference on resource-constrained environments.

  • Task Performance:

Evaluations on standardized benchmarks such as the Open LLM Leaderboard v2 and the RULER Benchmark demonstrate that xLSTM 7B is competitive with state-of-the-art models within the 7B parameter scale. Notably, in the RULER benchmark designed for long-context capability, xLSTM demonstrates an average accuracy of 20% at a context length of 131k tokens, despite having been trained only up to a context length of 32k tokens. Ablation studies further reinforce the significance of using RMSNorm, soft-capping techniques, and the negative input gate bias.

Broader Implications for Efficient Inference

The implications of adopting the xLSTM-based architecture extend beyond raw performance metrics:

  • Scalability in Inference:

Linear compute scaling with respect to sequence length transforms applications that demand high-turnover interactive sessions or rely on heavy test-time computation such as tree-of-thought reasoning. Constant memory usage during generation additionally promotes efficient inference when handling extended input contexts without degradation in throughput.

  • Deployment on Edge and Local Devices:

The significantly lower memory footprint, coupled with the use of fused GPU kernels, supports local and edge deployment scenarios where memory and compute resources are at a premium. This is particularly impactful for real-time applications that require low-latency responses.

  • Sustainability and Efficiency Gains:

The architectural modifications contribute to a reduction in overall energy consumption during inference, thus promoting more sustainable practices in the deployment of LLMs. In scenarios with heavy reliance on inference, the reduced computational overhead can significantly lower operational costs and carbon footprint.

Practical Implementation Considerations

For practitioners considering the integration or further research into xLSTM 7B, several technical and deployment aspects warrant careful planning:

  • GPU Kernel Optimization:

Developers should examine the fused generation kernel implementations to tailor them to specific GPU architectures. Given the custom operations, low-level tuning in CUDA and potential integration with libraries such as cuBLAS might be necessary to maximize performance.

  • Memory Management:

The recurrent design simplifies memory allocation by maintaining a constant footprint. However, ensuring that the implementation leverages this advantage requires rigorous profiling, especially in mixed-precision training/inference settings.

  • Parallelization and Deployment Strategies:

The independent gate pre-activations facilitate model parallelism similar to Transformer-based systems. This allows for deployment strategies that distribute computation across multiple GPUs or even across nodes in a cluster. Integration with model parallel frameworks (e.g., Megatron or DeepSpeed) can further optimize resource utilization.

  • Hyperparameter Configuration:

The empirically derived configuration (8 heads, head dimension of 512, dqk=256d_{qk}=256) has proven effective; however, exploration of alternative configurations may yield further refinements when adapting the model to specific domain tasks or scaling requirements.

  • Benchmarking and Evaluation:

It is advisable to replicate the benchmark protocols (e.g., Open LLM Leaderboard and RULER tasks) in a controlled environment to validate the performance claims pre-deployment. Detailed ablation studies may also help in understanding the impact of each architectural choice in specific application contexts.

In summary, xLSTM 7B sets a practical precedent for efficiently scaling recurrent architectures in LLMs. Its methodological contributions, effective fusion of optimized kernels, and empirical benchmarks underscore its potential in environments that prioritize inference speed, memory efficiency, and scalability. The open-source release of model weights and code further accelerates its adoption in both academic research and industrial applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Maximilian Beck (9 papers)
  2. Korbinian Pöppel (7 papers)
  3. Phillip Lippe (21 papers)
  4. Richard Kurle (9 papers)
  5. Patrick M. Blies (4 papers)
  6. Günter Klambauer (28 papers)
  7. Sebastian Böck (8 papers)
  8. Sepp Hochreiter (82 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com