Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers (2406.11274v1)

Published 17 Jun 2024 in cs.CL

Abstract: The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer, thus enhancing the diversity of multi-head attention without additional computational burden. Extensive experiments demonstrate that our enhanced Transformer model achieves superior performance in LLMing tasks, highlighting the effectiveness of our skip-layer attention mechanism.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Qian Chen (264 papers)
  2. Wen Wang (144 papers)
  3. Qinglin Zhang (30 papers)
  4. Siqi Zheng (61 papers)
  5. Shiliang Zhang (132 papers)
  6. Chong Deng (22 papers)
  7. Hai Yu (40 papers)
  8. Jiaqing Liu (20 papers)
  9. Yukun Ma (33 papers)
  10. Chong Zhang (137 papers)
Citations (1)