A Tensorized Transformer for Language Modeling (1906.09777v3)

Published 24 Jun 2019 in cs.CL and cs.LG

Abstract: Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in NLP tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a resource-limited setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three LLMing tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German). Multi-linear attention can not only largely compress the model parameters but also obtain performance improvements, compared with a number of LLMing approaches, such as Transformer, Transformer-XL, and Transformer with tensor train decomposition.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (7)

Xindian Ma (6 papers)
Peng Zhang (641 papers)
Shuai Zhang (319 papers)
Nan Duan (172 papers)
Yuexian Hou (23 papers)
Dawei Song (62 papers)
Ming Zhou (182 papers)

Citations (157)

View on Semantic Scholar

A Tensorized Transformer for Language Modeling (1906.09777v3)

Related Papers