Understanding positional embeddings in Transformers across modalities
Determine the role, necessity, and appropriate design of positional embeddings in Transformer architectures across different modalities, including cases such as point clouds and sketch drawing where token elements already encode coordinates, to ascertain when positional information is required and how it should be incorporated to preserve structural relationships.
References
How to understand position embedding to Transformers is an open problem.
— Multimodal Learning with Transformers: A Survey
(2206.06488 - Xu et al., 2022) in Discussion under Subsubsection "Input Tokenization" (Section 2.1.1)