Extend injectivity analysis to multimodal Transformers
Extend the almost-sure injectivity analysis—establishing that the map from input sequences to last-token hidden representations in causal decoder-only Transformer language models is injective and preserved under gradient-based training—to multimodal Transformer architectures used in vision and music, determining the precise assumptions and architectural conditions under which analogous injectivity guarantees hold for their respective input tokenizations and internal representations.
References
Extending the analysis to multimodal architectures such as music and vision Transformers is an open problem.
— Language Models are Injective and Hence Invertible
(2510.15511 - Nikolaou et al., 17 Oct 2025) in Discussion and conclusions (Section: Discussion and conclusions)