Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues (2411.12537v2)

Published 19 Nov 2024 in cs.LG, cs.CL, and cs.FL

Abstract: Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to Transformers in LLMing, offering linear scaling with sequence length and improved training efficiency. However, LRNNs struggle to perform state-tracking which may impair performance in tasks such as code evaluation or tracking a chess game. Even parity, the simplest state-tracking task, which non-linear RNNs like LSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity stems from restricting the value range of their diagonal state-transition matrices to $[0, 1]$ and that incorporating negative values can resolve this issue. We extend this result to non-diagonal LRNNs, which have recently shown promise in models such as DeltaNet. We prove that finite precision LRNNs with state-transition matrices having only positive eigenvalues cannot solve parity, while complex eigenvalues are needed to count modulo $3$. Notably, we also prove that LRNNs can learn any regular language when their state-transition matrices are products of identity minus vector outer product matrices, each with eigenvalues in the range $[-1, 1]$. Our empirical results confirm that extending the eigenvalue range of models like Mamba and DeltaNet to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. Furthermore, pre-training LRNNs with an extended eigenvalue range for LLMing achieves comparable performance and stability while showing promise on code and math data. Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.

Summary

The paper introduces a key modification in LRNN eigenvalue ranges, showing that negative eigenvalues enable effective state-tracking by solving parity tasks.
Empirical analysis demonstrates that extending the eigenvalue range to (-1, 1) boosts LRNN performance in tasks like modular arithmetic without sacrificing efficiency.
Theoretical insights reveal that diverse eigenvalue spectrums allow LRNNs to emulate finite state automata for complex sequential tasks.

Unlocking State-Tracking in Linear RNNs through Negative Eigenvalues

The paper at hand analyzes the limitations and potential of Linear Recurrent Neural Networks (LRNNs) in handling state-tracking tasks and proposes a fundamental enhancement to their architecture. Specifically, LRNNs provide a promising alternative to Transformer models, offering linear scalability with respect to sequence length—an attractive property given the quadratic complexity associated with sequence processing in Transformers. Despite these advantages, a notable limitation of LRNNs is their inability to perform efficient state-tracking, which is essential in tasks ranging from problem-solving in code evaluation to tracking in sequential games like chess.

The primary technical advancement addressed in this work is the modification of the eigenvalue range associated with the state-transition matrices of LRNNs. Conventionally, these matrices have eigenvalues confined between zero and one; however, the authors propose broadening this range to include negative values, specifically expanding the range to (-1, 1). This extension fundamentally enhances the expressive capacity of these networks, enabling them to solve tasks previously unattainable by LRNNs.

Key theoretical contributions include the establishment that current LRNNs, limited by positive eigenvalues, cannot solve basic tasks like computing parity—a fundamental state-tracking challenge. Through detailed proofs, the paper demonstrates that having at least one negative eigenvalue (or a complex eigenvalue in certain tasks) within the state-transition matrix is crucial for performing state-tracking tasks efficiently. Such problems, like modular counting with modulus not being a power of two, necessitate an eigenvalue range that allows for the representation of complex conjugates or non-real roots, thereby enriching the LRNN's capability to simulate finite state automata.

Empirically, the paper showcases that this modification effectively enables enhanced performance across a suite of state-tracking tasks, explicitly demonstrating successful parity computation and modular arithmetic problem-solving. The empirical results also highlight that extending the eigenvalue range doesn't negatively impact the computational efficiency or training stability of models such as Mamba and DeltaNet—variants of LRNNs.

In the theoretical analysis, the authors also leverage the results from formal language theory to extend the findings to non-diagonal state-transition matrices, proposing that repeated products of Generalized Householder (GH) matrices can represent any matrix of bounded norm. This theory underpins the practical realization of more expressive RNN constructs, highlighting conditions under which an LRNN can emulate any finite state automaton, crucial for broad usability in NLP and beyond.

Extending from expressivity considerations to tasks of practical import, the research posits that by incorporating negative eigenvalues into the architecture of LRNNs, models can become adept at handling composite AI tasks involving nested and sequential state dependencies. In particular, the incorporation of these capabilities without significantly altering the architecture in terms of layers or complexity marks an important step forward in the design of efficient, scalable neural networks that maintain the full scope of regular language recognition.

Potential applications and implications of this work are extensive. The proposed architectural enhancements imply that LRNNs can be employed effectively in areas requiring rigorous state-tracking across long temporal contexts, such as real-time language processing in streaming data scenarios, or systematic exploration and simulation scenarios in strategy and games. Furthermore, the insights garnered here underline pathways for synthesizing hybrid models that might draw on both the favorable scaling properties of LRNNs and the expressive prowess of extended eigenvalue ranges, potentially leading to models that approach the adaptability of human intelligence over long sequences.

Future work might explore the balance between expressive potential and training complexity, as well as hybrid architectures that merge the formal language-friendly features of LRNNs with Transformer-like parallelism. Extending the theoretical underpinning of how eigenvalue diversity within transition matrices impacts tasks under different linguistic hierarchies could unveil further relationships between mathematical properties of RNNs and practical application scenarios, guiding the design of the next generation of efficient, task-specific intelligent agents.

PDF Markdown

Related Papers

Tweets

https://twitter.com/riccardograzzi/status/1860017064473428220

https://twitter.com/riccardograzzi/status/1866617213173518573

https://twitter.com/SonglinYang4/status/1926065635282284707

https://twitter.com/leloykun/status/1901267945923498106

https://twitter.com/fly51fly/status/1859351506153242682

https://twitter.com/mhahn29/status/1919422216522825783

Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues (2411.12537v2)

Summary

Unlocking State-Tracking in Linear RNNs through Negative Eigenvalues

Related Papers

Tweets

YouTube

HackerNews

Reddit