Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention (2011.00943v2)

Published 2 Nov 2020 in cs.CL

Abstract: Recent research on the multi-head attention mechanism, especially that in pre-trained models such as BERT, has shown us heuristics and clues in analyzing various aspects of the mechanism. As most of the research focus on probing tasks or hidden states, previous works have found some primitive patterns of attention head behavior by heuristic analytical methods, but a more systematic analysis specific on the attention patterns still remains primitive. In this work, we clearly cluster the attention heatmaps into significantly different patterns through unsupervised clustering on top of a set of proposed features, which corroborates with previous observations. We further study their corresponding functions through analytical study. In addition, our proposed features can be used to explain and calibrate different attention heads in Transformer models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yue Guan (40 papers)
  2. Jingwen Leng (50 papers)
  3. Chao Li (429 papers)
  4. Quan Chen (91 papers)
  5. Minyi Guo (98 papers)
Citations (18)