Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-branch Attentive Transformer (2006.10270v2)

Published 18 Jun 2020 in cs.CL

Abstract: While the multi-branch architecture is one of the key ingredients to the success of computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. We leverage two training techniques to regularize the training: drop-branch, which randomly drops individual branches during training, and proximal initialization, which uses a pre-trained Transformer model to initialize multiple branches. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements. Our code is available at \url{https://github.com/HA-Transformer}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yang Fan (27 papers)
  2. Shufang Xie (29 papers)
  3. Yingce Xia (53 papers)
  4. Lijun Wu (113 papers)
  5. Tao Qin (201 papers)
  6. Xiang-Yang Li (77 papers)
  7. Tie-Yan Liu (242 papers)
Citations (17)

Summary

We haven't generated a summary for this paper yet.