Overcoming Vocabulary Constraints with Pixel-level Fallback (2504.02122v1)

Published 2 Apr 2025 in cs.CL

Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained LLMs with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric LLMs, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual LLMs without extensive retraining and reduces decoding latency via input compression.