Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReGLA: Refining Gated Linear Attention (2502.01578v2)

Published 3 Feb 2025 in cs.CL

Abstract: Recent advancements in LLMs have set themselves apart with their exceptional performance in complex LLMling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Peng Lu (86 papers)
  2. Ivan Kobyzev (23 papers)
  3. Mehdi Rezagholizadeh (78 papers)
  4. Boxing Chen (67 papers)
  5. Philippe Langlais (23 papers)

Summary

We haven't generated a summary for this paper yet.