LongSSM: On the Length Extension of State-space Models in Language Modelling

Published 4 Jun 2024 in cs.CL, cs.AI, cs.LG, and math.DS | (2406.02080v1)

Abstract: In this paper, we investigate the length-extension of state-space models (SSMs) in language modeling. Length extension involves training models on short sequences and testing them on longer ones. We show that state-space models trained with zero hidden states initialization have difficulty doing length extension. We explain this difficulty by pointing out the length extension is equivalent to polynomial extrapolation. Based on the theory, we propose a simple yet effective method - changing the hidden states initialization scheme - to improve the length extension. Moreover, our method shows that using long training sequence length is beneficial but not necessary to length extension. Changing the hidden state initialization enables the efficient training of long-memory model with a smaller training context length.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a novel hidden state initialization that converts the extrapolation challenge into interpolation, enhancing length extension in state-space models.
Experimental results show that previous hidden state initialization outperforms zero initialization, maintaining robust performance up to 32768 tokens.
The study bridges theoretical insights from polynomial extrapolation with practical training methods, paving the way for more efficient language modelling.

LongSSM: On the Length Extension of State-space Models in Language Modelling

The paper "LongSSM: On the Length Extension of State-space Models in Language Modelling" by Shida Wang addresses the challenge of length extension in state-space models (SSMs). It focuses on training models on short sequences while testing them on longer ones. The primary concern discussed is that SSMs initialized with zero hidden states struggle with length extension. This issue is attributed to the problem being similar to polynomial extrapolation, which is inherently challenging.

Key Contributions

State-space Models and Length Extension: The paper begins by contrasting SSMs with attention-based transformers, emphasizing their suitability for maintaining long-term dependencies despite their recurrent nature. The challenge outlined is that while SSMs exhibit "infinite-in-time" memory, they often falter when required to extrapolate beyond their training sequence length.
Length Extension Definition: Three types of length extension capabilities are introduced—strong, weak, and no length extension. The aim is to achieve a monotonic decrease in perplexity for weak length extension, indicating that the model retains its predictive power even as the sequence lengthens.
Model Initialization: The paper proposes changing the hidden states initialization from zero to using previous hidden states (truncated backpropagation through time) to improve length extension. This technique effectively shifts the extrapolation problem towards interpolation, which is generally more manageable.
Theoretical Analysis: A thorough theoretical analysis is provided, showing that the difficulty in length extension for zero-initialized hidden states is equivalent to polynomial extrapolation. In contrast, initializing hidden states with previous values transforms the task into one of interpolation, reducing the overall error.

Experimental Results

Zero vs. Previous Initialization:

Models trained with zero-initialized hidden states demonstrate significant performance degradation beyond sequence lengths of 1024.
In contrast, models trained using previous hidden state initialization show robust length extension up to sequence lengths of 32768 without requiring overly long training sequences.

Length Extension and Model Size:

Larger models with zero-initiation exhibit worse length extension, necessitating longer training sequences to mitigate overfitting.
The proposed change in initialization (to the hidden states using the previous values) allows models to generalize better even with shorter training sequences, drastically reducing GPU memory requirements.

Training Stability:

An additional challenge identified is the training instability associated with large models when previous-initialized hidden states are used. This is particularly notable in models with 140M parameters, where training instability becomes a significant issue.

Implications

Practical Applications: The proposed methodology provides a practical solution to the computational constraints of training models over long sequences. This is especially relevant for applications requiring long-context understanding such as language modeling for novel writing or autonomous driving.
Theoretical Developments: The paper provides a bridge between theoretical challenges in polynomial extrapolation and practical training methods in state-space models. This connection underscores the need for further exploration into stable training methods that maintain long-term dependencies without overfitting or instability.
Future Research: The insights gathered strongly suggest the need for more robust methods to manage hidden state dynamics. Future developments could focus on stabilizing previous-initialized hidden states to harness their benefits without the associated training instability.

Conclusion

The paper "LongSSM: On the Length Extension of State-space Models in Language Modelling" makes significant contributions to the understanding and enhancement of length extension capabilities in state-space models. By addressing the limitations of zero-initialized hidden states, proposing a novel initialization scheme, and validating through comprehensive experiments, the paper paves the way for more efficient and effective language modeling techniques. Further research in stabilizing the training process could unlock even greater potential for SSMs in handling long-context sequences proficiently.

Markdown