Pretraining Without Attention

Published 20 Dec 2022 in cs.CL and cs.LG | (2212.10544v2)

Abstract: Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT in terms of interactions and syntactic representations. All models from this work are available at https://github.com/jxiw/BiGS.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (42)

View on Semantic Scholar

Summary

The paper demonstrates that BiGS, using state-space models and multiplicative gating, achieves competitive accuracy on GLUE without attention mechanisms.
The methodology maintains efficiency, matching BERT’s performance across training scales from 11B to 97B tokens.
The work highlights BiGS's capacity to model long-range dependencies and syntactic patterns, offering a promising alternative to Transformer architectures.

Pretraining Without Attention: A Critical Evaluation

The paper "Pretraining Without Attention" by Wang et al. proposes an innovative architecture using Bidirectional Gated State-Space Models (BiGS) to achieve competitive pretraining results in NLP without relying on attention mechanisms. Rather than depending on traditional Transformer architectures, the authors explore the potential of sequence routing through State-Space Models (SSMs), addressing both the computational efficiency and the representational capacity associated with large-scale NLP pretraining.

Key Contributions

BiGS integrates SSM layers with a multiplicative gating architecture, allowing the model to maintain a competitive accuracy on benchmarks such as GLUE, akin to BERT-level performance. This achievement demonstrates that it is feasible to achieve robust LLM pretraining without the quadratic complexity typical of attention mechanisms.

Numerical Results

The paper highlights several numerical outcomes that illustrate the viability of the proposed methods:

In short training scenarios (~11B tokens), BiGS matches BERT's reported accuracy on GLUE, with an average score of 83.3.
For medium-scale training (~29B tokens), the model maintains parity with traditional attention-based architectures.
In full training conditions (~97B tokens), BiGS achieves an average accuracy of 85.8, affirming its competitive nature under extensive pretraining.

These results vividly argue for the model’s efficacy, especially under different training scales, and suggest a promising alternative to attention-based pretraining architectures.

Theoretical and Practical Implications

BiGS leverages the SSM's ability to effectively model long-range dependencies. This becomes particularly advantageous when scaling to longer input sequences, with BiGS handling up to 4096 tokens without approximation. The use of SSMs and element-wise multiplicative gating prevents degradation of performance typical in architectures lacking attention, while significantly addressing computational burdens.

From a theoretical perspective, the model presents different inductive biases compared to conventional transformers. The work suggests that these biases might favor syntactic pattern recognition, as evidenced by its performance on tasks targeting syntactic phenomena. This is illustrated in the model's notable performance on tasks with intricate syntactic dependencies.

Future Directions in AI Research

This research opens several avenues for future exploration. Notably, the investigation of SSM-based models for non-English corpora and multi-lingual tasks stands as a critical next step, weaving these methodologies further into diverse and real-world applications. Moreover, improving the computational efficiency of SSM implementations in mainstream hardware can unlock broader applications where resources are constrained.

The findings prompt further investigation into the syntactic underpinnings of LLMs, fostering a deeper understanding of diverse inductive biases and their implications on language representation and processing.

Conclusion

The study of pretraining without attention through the BiGS framework establishes a foundational shift in thinking about sequence modeling. While the method showcases equivalent performance to established models, it does so with unique advantages in computational efficiency and potentially in capturing syntactic nuances. These contributions mark an important stride toward more efficient, flexible LLMs that could transform the future trajectory of NLP and its associated applications.

Markdown Report Issue