FIT: Far-reaching Interleaved Transformers (2305.12689v2)

Published 22 May 2023 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: We present FIT: a transformer-based architecture with efficient self-attention and adaptive computation. Unlike original transformers, which operate on a single sequence of data tokens, we divide the data tokens into groups, with each group being a shorter sequence of tokens. We employ two types of transformer layers: local layers operate on data tokens within each group, while global layers operate on a smaller set of introduced latent tokens. These layers, comprising the same set of self-attention and feed-forward layers as standard transformers, are interleaved, and cross-attention is used to facilitate information exchange between data and latent tokens within the same group. The attention complexity is $O(n^2)$ locally within each group of size $n$, but can reach $O(L^{{4}/{3}})$ globally for sequence length of $L$. The efficiency can be further enhanced by relying more on global layers that perform adaptive computation using a smaller set of latent tokens. FIT is a versatile architecture and can function as an encoder, diffusion decoder, or autoregressive decoder. We provide initial evidence demonstrating its effectiveness in high-resolution image understanding and generation tasks. Notably, FIT exhibits potential in performing end-to-end training on gigabit-scale data, such as 6400$\times$6400 images, or 160K tokens (after patch tokenization), within a memory capacity of 16GB, without requiring specific optimizations or model parallelism.

Authors (2)

Ting Chen (148 papers)
Lala Li (11 papers)

Citations (9)

View on Semantic Scholar

Summary

An Overview of FIT: Far-reaching Interleaved Transformers

The paper "FIT: Far-reaching Interleaved Transformers" introduces an innovative transformer-based architecture, named FIT, which emphasizes efficient self-attention and adaptive computation. FIT seeks to mitigate the complexity challenge faced by traditional transformers, especially when handling long sequences. This is achieved by segmenting data tokens into groups, where local interactions occur within each group, and global interactions are facilitated through latent tokens.

Architectural Composition

FIT incorporates two types of transformer layers. The local layers are designed to handle intra-group interactions, utilizing a conventional self-attention mechanism, which maintains an $O(n^2)$ complexity. Meanwhile, the global layers employ latent tokens to manage inter-group interactions, boasting a complexity of $O(L^{\frac{4}{3}})$ for a sequence length $L$ . This layered strategy aims to compress computational demands while retaining the effectiveness of full attention.

FIT can function as either an encoder, diffusion decoder, or autoregressive decoder, showcasing its versatility across various tasks, including high-resolution image tasks. For instance, it supports training on expansive data sets, such as 6400 $\times$ 6400 images, utilizing standard hardware without necessitating intricate optimizations.

Theoretical and Practical Implications

The theoretical underpinning of FIT’s design lies in its ability to dynamically balance the workload between local and global components. The introduction of latent tokens offers a novel layer of adaptive tokenization, allowing the model to make informed compression decisions during computation, thereby optimizing both memory and processing efficiency.

Practically, the architecture's potential is evident in its successful application in high-resolution image processing tasks and pixel-based modeling. Its unique structure allows it to handle large-scale inputs efficiently, which is crucial for scaling AI models to more complex data regimes.

Numerical Insights and Results

FIT demonstrates promising numerical results in terms of computational savings, providing a nearly linear relationship between computational cost and sequence length. In various use cases, such as Pix2Seq object detection, FIT outperforms traditional ViT models by delivering both performance improvements and heightened training speeds. Notably, FIT-B models enhance the underlying architecture by interleaving local and global layers, leading to improved object detection pretraining results.

Future Directions

The introduction of FIT paves the way for further exploration into more refined interleaving mechanisms and potential expansions into other domains like video processing and natural language. There is potential curiosity surrounding how FIT might be adapted or further optimized for different data modalities or more extensive range tasks, bridging the gap between theoretical flexibility and practical application.

In conclusion, this paper sheds light on an efficient and adaptable transformer approach that balances computational intensity and performance, opening new horizons for AI architecture in processing voluminous data. While the presented empirical evaluations are initial, they lay a foundation for extensive future investigation into fully leveraging FIT's capabilities across diverse AI applications.

PDF Markdown

Related Papers

GitHub

GitHub - google-research/pix2seq: Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion) (827 stars)