Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scheduled DropHead: A Regularization Method for Transformer Models (2004.13342v2)

Published 28 Apr 2020 in cs.CL and cs.LG

Abstract: In this paper, we introduce DropHead, a structured dropout method specifically designed for regularizing the multi-head attention mechanism, which is a key component of transformer, a state-of-the-art model for various NLP tasks. In contrast to the conventional dropout mechanisms which randomly drop units or connections, the proposed DropHead is a structured dropout method. It drops entire attention-heads during training and It prevents the multi-head attention model from being dominated by a small portion of attention heads while also reduces the risk of overfitting the training data, thus making use of the multi-head attention mechanism more efficiently. Motivated by recent studies about the learning dynamic of the multi-head attention mechanism, we propose a specific dropout rate schedule to adaptively adjust the dropout rate of DropHead and achieve better regularization effect. Experimental results on both machine translation and text classification benchmark datasets demonstrate the effectiveness of the proposed approach.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wangchunshu Zhou (73 papers)
  2. Tao Ge (53 papers)
  3. Ke Xu (310 papers)
  4. Furu Wei (292 papers)
  5. Ming Zhou (182 papers)
Citations (35)

Summary

We haven't generated a summary for this paper yet.