Papers
Topics
Authors
Recent
Search
2000 character limit reached

Overcoming Vocabulary Constraints with Pixel-level Fallback

Published 2 Apr 2025 in cs.CL | (2504.02122v1)

Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained LLMs with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric LLMs, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual LLMs without extensive retraining and reduces decoding latency via input compression.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.