Identifying the strengths of Transformers for multimodal machine learning
Characterize and identify the key strengths of Transformer architectures in multimodal machine learning through rigorous theoretical and empirical analysis, clarifying advantages such as encoding implicit knowledge, multi-head representation subspaces, global aggregation, and graph-compatible tokenization across diverse modalities.
References
Identifying the strengths of Transformers for multimodal machine learning is a big open problem.
— Multimodal Learning with Transformers: A Survey
(2206.06488 - Xu et al., 2022) in Section 7 "Discussion and Outlook"