Alleviating the Inequality of Attention Heads for Neural Machine Translation

Published 21 Sep 2020 in cs.CL | (2009.09672v2)

Abstract: Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.