Overcoming Vocabulary Constraints with Pixel-level Fallback (2504.02122v1)
Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained LLMs with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric LLMs, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual LLMs without extensive retraining and reduces decoding latency via input compression.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.