Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Superiority of Multi-Head Attention in In-Context Linear Regression (2401.17426v1)

Published 30 Jan 2024 in cs.LG, cs.AI, and stat.ML

Abstract: We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing literature predominantly focuses on the convergence of transformers with single-/multi-head attention, our research centers on comparing their performance. We conduct an exact theoretical analysis to demonstrate that multi-head attention with a substantial embedding dimension performs better than single-head attention. When the number of in-context examples D increases, the prediction loss using single-/multi-head attention is in O(1/D), and the one for multi-head attention has a smaller multiplicative constant. In addition to the simplest data distribution setting, we consider more scenarios, e.g., noisy labels, local examples, correlated features, and prior knowledge. We observe that, in general, multi-head attention is preferred over single-head attention. Our results verify the effectiveness of the design of multi-head attention in the transformer architecture.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yingqian Cui (14 papers)
  2. Jie Ren (329 papers)
  3. Pengfei He (36 papers)
  4. Jiliang Tang (204 papers)
  5. Yue Xing (47 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.