Papers
Topics
Authors
Recent
2000 character limit reached

Overcoming Vocabulary Constraints with Pixel-level Fallback (2504.02122v1)

Published 2 Apr 2025 in cs.CL

Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained LLMs with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric LLMs, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual LLMs without extensive retraining and reduces decoding latency via input compression.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.