An Introduction to Transformers (2304.10557v5)

Published 20 Apr 2023 in cs.LG and cs.AI

Abstract: The transformer is a neural network component that can be used to learn useful representations of sequences or sets of data-points. The transformer has driven recent advances in natural language processing, computer vision, and spatio-temporal modelling. There are many introductions to transformers, but most do not contain precise mathematical descriptions of the architecture and the intuitions behind the design choices are often also missing. Moreover, as research takes a winding path, the explanations for the components of the transformer can be idiosyncratic. In this note we aim for a mathematically precise, intuitive, and clean description of the transformer architecture. We will not discuss training as this is rather standard. We assume that the reader is familiar with fundamental topics in machine learning including multi-layer perceptrons, linear transformations, softmax functions and basic probability.

Summary

The paper presents transformer models that utilize self-attention and MLP layers to robustly process sequential data.
It details how fixed and learnable embeddings convert raw data into tokens for broad applicability across multiple modalities.
The study highlights the critical role of position encoding and multi-head attention in capturing dependencies within sequences.

An Insightful Overview of Transformers Architecture

Introduction to Transformers Architecture

Transformers have emerged as a powerful architecture for processing sequences or sets of data points across various domains including natural language processing, computer vision, and spatio-temporal modeling. Unlike prior architectures which necessitated unique designs for each modality, transformers utilize a uniform approach, making them adaptable to diverse data types through tokenization.

Data Preprocessing for Transformers

Transformers require data to be formatted as a set or sequence of $N$ tokens, each of dimension $D$ . This universal data representation allows for wide applicability across different modalities, such as textual data represented by word vectors and images divided into patch vectors. Both fixed and learnable embedding techniques can be employed to convert raw data into the requisite format, enhancing the model's flexibility and reducing the need for custom architectures catering to specific data types.

Transformer Block Structure

A transformer block is composed of two primary stages: self-attention across the sequence and a Multi-Layer Perceptron (MLP) across features. The self-attention mechanism, or Multi-Head Self-Attention (MHSA), aggregates information across the sequence, enabling each feature vector to be refined based on its relationship with all other vectors in the input set. This mechanism facilitates the model's ability to capture dependencies among data points, regardless of their position within the sequence. The MLP stage further refines these feature vectors through nonlinear transformations, enhancing the representational power of the network.

Self-Attention Mechanism: Utilizes an attention matrix to weigh the importance of different tokens in the sequence relative to each other, based on their content and positional relationship. It allows the model to dynamically adjust its focus on relevant parts of the input data.
Multi-Head Self-Attention: Enhances the capacity of the attention mechanism by allowing the model to attend to information from different representation subspaces simultaneously. This is achieved through parallel processing of input data with multiple attention heads, each capable of capturing different aspects of the data's relationships.

Position Encoding

A critical aspect of transformer design is the inclusion of positional information, as the self-attention mechanism treats the input data as an unordered set. Positional encoding schemes, either fixed or learnable, are used to incorporate spatial or sequential information into the model, ensuring that the transformer can recognize and utilize the order of the input tokens. These schemes can be directly added to the token embeddings or passed through the model in a way that impacts the attention calculations, ensuring that positional relationships are preserved and utilized effectively.

Application-Specific Variants

Transformers can be tailored for specific tasks by modifying the architecture's body or adding specific "heads" designed for the desired output. Examples include auto-regressive LLMing, where a masked version of the transformer is employed for efficient training and inference, and image classification, where a dedicated classification token is introduced to maintain a global representation of the input image throughout the processing stages. These adaptations highlight the versatility and adaptability of transformers to a wide range of tasks beyond their initial applications in natural language processing.

Conclusion

The transformer architecture represents a significant advance in machine learning, offering a flexible and powerful framework for processing sequential or set-based data. Its ability to dynamically focus on relevant parts of the input through self-attention and to integrate positional information makes it applicable to a broad spectrum of tasks and data modalities. Future developments in AI and machine learning are likely to further explore and expand the capabilities of transformers, potentially unlocking new insights and applications across diverse fields.