Extend injectivity analysis to multimodal Transformers

Extend the almost-sure injectivity analysis—establishing that the map from input sequences to last-token hidden representations in causal decoder-only Transformer language models is injective and preserved under gradient-based training—to multimodal Transformer architectures used in vision and music, determining the precise assumptions and architectural conditions under which analogous injectivity guarantees hold for their respective input tokenizations and internal representations.

Background

The paper proves that standard causal decoder-only Transformer LLMs are almost-surely injective with respect to the map from discrete prompts to the last-token hidden state, both at initialization and after any finite number of gradient descent updates. This is established via real-analyticity of components, measure-zero collision sets, and preservation of absolute continuity under training.

Building on this, the authors introduce SipIt, an algorithm that inverts hidden states to exactly recover input prompts with linear-time guarantees, operationalizing the theoretical injectivity. In the conclusions, the authors explicitly note that extending these results beyond text-only models to multimodal architectures (e.g., vision and music Transformers) remains open.

References

Extending the analysis to multimodal architectures such as music and vision Transformers is an open problem.

— Language Models are Injective and Hence Invertible (2510.15511 - Nikolaou et al., 17 Oct 2025) in Discussion and conclusions (Section: Discussion and conclusions)

Extend injectivity analysis to multimodal Transformers

Background

References

Related Problems