Rethinking the long-range dependency in Mamba/SSM and transformer models (2509.04226v1)

Published 4 Sep 2025 in cs.LG

Abstract: Long-range dependency is one of the most desired properties of recent sequence models such as state-space models (particularly Mamba) and transformer models. New model architectures are being actively developed and benchmarked for prediction tasks requiring long-range dependency. However, the capability of modeling long-range dependencies of these models has not been investigated from a theoretical perspective, which hinders a systematic improvement on this aspect. In this work, we mathematically define long-range dependency using the derivative of hidden states with respect to past inputs and compare the capability of SSM and transformer models of modeling long-range dependency based on this definition. We showed that the long-range dependency of SSM decays exponentially with the sequence length, which aligns with the exponential decay of memory function in RNN. But the attention mechanism used in transformers is more flexible and is not constrained to exponential decay, which could in theory perform better at modeling long-range dependency with sufficient training data, computing resources, and proper training. To combine the flexibility of long-range dependency of attention mechanism and computation efficiency of SSM, we propose a new formulation for hidden state update in SSM and prove its stability under a standard Gaussian distribution of the input data.

Summary

The paper introduces a novel hybrid SSM formulation that integrates interaction terms reminiscent of transformer attention to mitigate exponential LRD decay.
It demonstrates that pure SSMs like Mamba excel in medium-range retention but suffer from exponential decay, while transformers offer more flexibility at the cost of increased data needs.
The research establishes a theoretical framework using hidden state derivatives for LRD analysis and provides probabilistic stability bounds under Gaussian assumptions.

Rethinking Long-Range Dependency in Mamba/SSM and Transformer Models

Introduction

The paper investigates the theoretical capability of modeling long-range dependencies (LRD) in state-space models (SSMs), particularly Mamba, and transformer models. Long-range dependencies are critical for sequence models used in tasks such as text generation, biological sequence analysis, and signal processing. While transformers have shown promise due to their flexible attention mechanism, there hasn't been a thorough theoretical exploration of the LRD capabilities and constraints of SSMs and transformers.

Background and Motivation

Recurrent Neural Networks (RNNs) have traditionally struggled with LRD due to vanishing gradients and training inefficiencies. Transformers have mitigated some of these issues through parallel computation and attention mechanisms, but suffer from quadratic time complexity during prediction. SSMs like Mamba promise linear time complexity and better medium-range memory retention, using structured state-space settings initialized with HiPPO matrices.

Long-Range Dependency Analysis

The paper provides a mathematical definition of LRD based on the derivative of hidden states with respect to past inputs. This framework allows for a comparison between SSMs and transformers in theoretical terms:

SSMs/Mamba: Exhibiting exponential decay in LRD as sequence length increases. Eigenvalue-based initialization constrains the efficiency in capturing LRD. While efficient in medium-range dependency retention, the exponential decay indicates limitations for very long dependencies.
Transformers: The attention mechanism lacks exponential decay constraints, theoretically allowing better long-range dependency modeling. This flexibility comes at the cost of requiring more training data to avoid overfitting due to lesser inductive biases compared to SSMs.

Proposed Hybrid Model

Inspired by the strengths and weaknesses of both architectures, the paper introduces an innovative SSM formulation incorporating "interaction" terms similar to the attention mechanism in transformers:

Hidden State Update: Reformulated as $h_t = (\bar{A}_t + G x_t x_t^T W^T) h_{t-1} + \bar{B}_t x_t$ , where interaction strength modulates the hidden state transition.
Advantages: Demonstrated flexibility in LRD, breaking away from exponential decay, with empirical results showing variations in LRD response over time.
Stability: Probabilistic bounds are provided ensuring model stability under Gaussian input assumptions, with eigenvalues carefully constrained.

Implications and Future Work

The new formulation promises improved efficiency in modeling long-range dependencies without the inherent decay constraints of pure SSM approaches. It's a pivotal step toward resolving the fundamental limitation in SSMs. The paper suggests several future research pathways:

General Stability Conditions: Expanding stability analysis beyond the simplified assumptions made.
Algorithmic Efficiency: Developing efficient computation strategies parallel to those used in traditional SSM/Mamba frameworks.
Benchmark Evaluation: Testing against existing datasets to gauge improvement over current hybrid models interfacing Mamba and transformers.

Conclusion

The paper makes significant strides toward understanding and enhancing the theoretical modeling capabilities of sequence models with respect to long-range dependencies. By proposing a new interaction-based SSM framework, it extends the practical applicability and efficiency of these models in complex prediction tasks, setting a foundation for further exploration into hybrid modeling techniques combining structured state spaces and attention mechanisms.