Understanding positional embeddings in Transformers across modalities

Determine the role, necessity, and appropriate design of positional embeddings in Transformer architectures across different modalities, including cases such as point clouds and sketch drawing where token elements already encode coordinates, to ascertain when positional information is required and how it should be incorporated to preserve structural relationships.

Background

The survey notes that positional embeddings are commonly added to token embeddings to retain positional information, but their necessity may vary across modalities. For example, in point clouds and sketch drawing, token elements can themselves be coordinates, suggesting positional embeddings might be optional. Conversely, attention mechanisms are inherently position-invariant without positional information, implying positional embeddings are often necessary.

This ambiguity highlights a need for principled guidance on when and how to use positional embeddings in Transformer models based on modality-specific characteristics and structural requirements.

References

How to understand position embedding to Transformers is an open problem.

— Multimodal Learning with Transformers: A Survey (2206.06488 - Xu et al., 2022) in Discussion under Subsubsection "Input Tokenization" (Section 2.1.1)

Understanding positional embeddings in Transformers across modalities

Background

References

Related Problems