Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention (2207.13354v1)

Published 27 Jul 2022 in cs.CL

Abstract: Impressive performance of Transformer has been attributed to self-attention, where dependencies between entire input in a sequence are considered at every position. In this work, we reform the neural $n$-gram model, which focuses on only several surrounding representations of each position, with the multi-head mechanism as in Vaswani et al.(2017). Through experiments on sequence-to-sequence tasks, we show that replacing self-attention in Transformer with multi-head neural $n$-gram can achieve comparable or better performance than Transformer. From various analyses on our proposed method, we find that multi-head neural $n$-gram is complementary to self-attention, and their combinations can further improve performance of vanilla Transformer.

Authors (4)

Mengsay Loem (8 papers)
Sho Takase (25 papers)
Masahiro Kaneko (46 papers)
Naoaki Okazaki (70 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Related Papers

Self-Attention with Relative Position Representations (2018)
Fast Transformer Decoding: One Write-Head is All You Need (2019)
Transformers Can Represent $n$-gram Language Models (2024)
Transformer++ (2020)
Multi-Granularity Self-Attention for Neural Machine Translation (2019)