Gated Slot Attention for Efficient Linear-Time Sequence Modeling (2409.07146v2)

Published 11 Sep 2024 in cs.CL

Abstract: Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

PDF Abstract

An Evaluation of Gated Slot Attention for Sequence Modeling

The presented paper explores the development and application of Gated Slot Attention (GSA), a novel architecture aimed at improving the efficiency and effectiveness of sequence modeling tasks, especially in contexts requiring high recall and efficient inference during finetuning. The research attempts to alleviate the oft-cited performance and resource constraints of traditional Transformer models by harnessing a gated variant of linear attention combined with memory-efficient techniques.

Central to the paper is the reformulation of the Attention with Bounded-Memory Control (ABC) model, which typically suffers from performance inefficiencies attributed to its slow training speeds and mediocre LLMing outcomes. The integration of gating mechanisms, as seen in Gated Linear Attention (GLA), into the ABC framework results in the conception of GSA. The two-fold GSA design links layers through softmax, providing enhanced context-aware memory management during training and inference.

Performance and Evaluation

The paper includes extensive empirical evaluations of GSA across various LLMing and understanding tasks. It illustrates that GSA achieves comparable or superior performance to existing linear models in contexts necessitating in-context recall, leading to the assertion that GSA is beneficial for tasks that extend beyond typical LLMing benchmarks to those involving recall-intensive processes. Furthermore, the adaptable nature of GSA enables it to fit within the "finetuning pretrained Transformers to RNNs" (T2R) paradigm more efficiently than conventional models, requiring fewer resources and training data.

A focal point of assessment was the setup where GSA was finetuned from Mistral 7B, a preexisting LLM, demonstrating a pronounced advantage over other linear models such as RWKV6-7B and Mamba-7B in recall-intensive tasks. This finding supports the paper's contention that retaining the softmax operator significantly boosts performance by resolving the discrepancies typically seen when finetuning pretrained softmax attention models to linear attention counterparts.

Implications and Future Directions

The theoretical implications of the research suggest a potential shift in sequence modeling towards frameworks that effectively balance model complexity and performance. By retaining the softmax operation and reducing reliance on large state sizes, GSA broadens the applicability of linear attention models for practical deployment in resource-constrained environments. This development not only promises enhancements in training and inference speeds but also suggests an incremental expansion along the Pareto frontier of the recall-memory tradeoff.

Practically, GSA extends the potential of linear models to scale efficiently without the prohibitive computational costs typically associated with Transformer training on large datasets. The dual focus on training efficiency and model performance is of particular importance given the burgeoning demand for scalable models that maintain competitive accuracy.

Future directions for this line of research could explore the incorporation of hybrid GSA-attention models, which may further optimize performance on tasks requiring both sequential precision and extensive context comprehension. Additionally, advancing the integration of delta-update rules may further enhance GSA's capability for in-context retrieval.

In summary, this paper provides a methodological advancement through Gated Slot Attention, addressing key deficiencies in linear sequence modeling and positioning itself as a viable alternative architecture for tasks that demand efficient memory and inference operations.