Performance of Text-as-Image Prompting Across Domains and Tasks

Determine the performance of text-as-image prompting—rendering textual inputs as images for processing by multimodal large language models—on domains such as medical and legal and tasks such as coding and translation.

Background

The paper investigates using visual text representations—rendering long textual contexts as a single image—to reduce decoder token usage in multimodal LLMs while maintaining downstream performance. It demonstrates nearly half the token savings on long-context retrieval (RULER S-NIAH) and document summarization (CNN/DailyMail) with GPT-4.1-mini and Qwen2.5-VL-72B-Instruct.

However, the experiments cover a limited set of benchmarks and domains. In the Limitations section, the authors explicitly note that performance in other domains (e.g., medical, legal) and tasks (e.g., coding, translation) remains an open question, indicating that broader generalization of the text-as-image prompting approach is unresolved.

References

Furthermore, our experiments focus on a limited number of benchmarks, leaving open questions about performance on other domains (e.g., medical, legal) and tasks (e.g., coding, translation).

— Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs (2510.18279 - Li et al., 21 Oct 2025) in Section: Limitations

Performance of Text-as-Image Prompting Across Domains and Tasks

Sponsor

Background

References

Related Problems