Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Influence Patterns for Explaining Information Flow in BERT (2011.00740v3)

Published 2 Nov 2020 in cs.CL

Abstract: While attention is all you need may be proving true, we do not know why: attention-based transformer models such as BERT are superior but how information flows from input tokens to output predictions are unclear. We introduce influence patterns, abstractions of sets of paths through a transformer model. Patterns quantify and localize the flow of information to paths passing through a sequence of model nodes. Experimentally, we find that significant portion of information flow in BERT goes through skip connections instead of attention heads. We further show that consistency of patterns across instances is an indicator of BERT's performance. Finally, We demonstrate that patterns account for far more model performance than previous attention-based and layer-based methods.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kaiji Lu (5 papers)
  2. Zifan Wang (75 papers)
  3. Piotr Mardziel (18 papers)
  4. Anupam Datta (51 papers)
Citations (13)