Sub-Linear Memory: How to Make Performers SLiM (2012.11346v1)

Published 21 Dec 2020 in cs.LG

Abstract: The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory during training, and still requires $O(L)$ time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Valerii Likhosherstov (25 papers)
Krzysztof Choromanski (96 papers)
Jared Davis (12 papers)
Xingyou Song (31 papers)
Adrian Weller (150 papers)

Citations (18)

View on Semantic Scholar

Sub-Linear Memory: How to Make Performers SLiM (2012.11346v1)

Related Papers