Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
89 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
29 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
90 tokens/sec
DeepSeek R1 via Azure Premium
55 tokens/sec
GPT OSS 120B via Groq Premium
468 tokens/sec
Kimi K2 via Groq Premium
207 tokens/sec
2000 character limit reached

Quantifying Context Mixing in Transformers (2301.12971v2)

Published 30 Jan 2023 in cs.CL and cs.LG

Abstract: Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models. But despite their ease of interpretation, these weights are not faithful to the models' decisions as they are only one part of an encoder, and other components in the encoder layer can have considerable impact on information mixing in the output representations. In this work, by expanding the scope of analysis to the whole encoder block, we propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer. We demonstrate the superiority of our context mixing score over other analysis methods through a series of complementary evaluations with different viewpoints based on linguistically informed rationales, probing, and faithfulness analysis.

Citations (21)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.