Impact of text-as-image prompting at extremely large context lengths
Determine the effectiveness and limitations of representing text as images for multimodal LLMs when contexts span tens of thousands of tokens, quantifying its impact on accuracy, efficiency, and latency, and assessing whether specialized techniques are required to ensure reliable performance at this scale.
References
Despite showing promising token savings on short to medium context scenarios, our work has not yet fully evaluated the impact of text-as-image prompting on extremely large contexts that span tens of thousands of tokens or more.
— Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
(Li et al., 21 Oct 2025) in Limitations