Flowformer: Linearizing Transformers with Conservation Flows (2202.06258v2)

Published 13 Feb 2022 in cs.LG and cs.AI

Abstract: Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation into attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning. The code and settings are available at this repository: https://github.com/thuml/Flowformer.

Citations (77)

View on Semantic Scholar

Summary

The paper introduces Flow-Attention, a novel mechanism that linearizes Transformers by reformulating attention as a flow network.
It leverages conservation flows to eliminate quadratic complexity while ensuring competitive performance on tasks like language modeling and image recognition.
Extensive benchmarks demonstrate improved metrics, such as higher accuracy in LRA and lower perplexity on WikiText-103.

An Overview of "Flowformer: Linearizing Transformers with Conservation Flows"

In the steady ascent of Transformers through the domains of natural language processing, computer vision, time series analysis, and reinforcement learning, the quadratic complexity of the attention mechanism has unavoidably been a bottleneck for extensive sequence processing. This paper introduces the "Flowformer," an innovative attempt to linearize the Transformer architecture by reformulating attention through the lens of network flow theory, thereby introducing an efficient and bias-free linear-time approach while maintaining the attention's expressive power and generality.

Core Insights and Methodology

Flowformer tackles the quadratic bottleneck by re-envisioning attention as a flow network—a directed graph where information flows between nodes under capacity constraints. This formulation allows the enforcement of flow conservation properties. These properties ensure that for each sink (result), the incoming flow from sources (values) is conserved, and for each source, the outgoing flow to the sinks is conserved. These operations directly lead to competitive source and allocation sink mechanisms, which inherently stabilize and diversify attention weights without needing inductive biases like locality typically introduced in other linear attention methods.

The core mechanism, termed Flow-Attention, uses this flow conservation to dynamically calibrate the attention through incoming (sink) and outgoing (source) flow normalization applied as multiplicative transformations of the values and results respectively. It simultaneously implements the competition mechanism to avert trivial uniform attention—and this is achieved without the computational burden of query-key multiplication, circumventing the quadratic complexity altogether.

Empirical Evaluation

Flowformer is empirically validated across diverse benchmarks, encapsulating long sequence modeling in the LRA benchmark, language modeling with WikiText-103, image recognition on ImageNet-1K, time series classification from the UEA archive, and offline reinforcement learning via D4RL. It consistently achieves competitive or superior results compared to both traditional Transformer architectures and other linear Transformer variants while maintaining linear complexity.

In the LRA benchmark, Flowformer outperforms other models, including the standard Transformer, achieving a notable increase in average accuracy (from 55.23% to 56.48%). In language modeling, it achieves a perplexity of 30.8, surpassing even the canonical Transformer. For ImageNet-1K, Flowformer attains a Top-1 accuracy of 80.6%, demonstrating its applicability to large-scale image classification tasks and snugly integrating into existing high-efficiency Transformer structures such as DeiT for pragmatic benefits.

Implications and Future Directions

The Flowformer model presents a pivotal development in Transformer architecture, addressing both efficiency and task adaptability without compromising the model’s generality or the informative diverseness of attention. The introduction of flow conservation as a foundational computational mechanism could inspire further exploration of graph-based perspectives on neural architectures, inviting opportunities for novel algorithmic paradigms in AI models.

As the field progresses, significant attention remains on scaling the Flowformer to larger, more generalized pre-trained models, unlocking further efficiencies in both upstream and downstream applications. Additionally, future work may explore further mathematical generalizations and optimizations within the flow network paradigm to enhance model performance and robustness.

Flowformer opens a promising avenue—not through the mere act of efficient computation but by redefining computational frameworks themselves within neural architectures.