From Pixels and Words to Waves: A Unified Framework for Spectral Dictionary vLLMs (2506.18943v1)
Abstract: Vision-LLMs (VLMs) unify computer vision and natural language processing in a single architecture capable of interpreting and describing images. Most state-of-the-art systems rely on two computationally intensive components: convolutions in the vision encoder and quadratic self-attention for multimodal fusion. This work removes both by introducing a spectral dictionary token mixer, which represents each image patch or wordpiece as a sparse combination of learnable frequency atoms. Our 1.1B-parameter prototype, SDict-VLM, achieves BLEU-4 of 39.2, CIDEr of 127.5, and SPICE of 27.0 on MS-COCO captioning, along with 50.3 percent accuracy on VQAv2. These results close approximately 85 percent of the performance gap to BLIP-2 while using 60 percent fewer parameters, 2.3 times less peak GPU memory, and 2.2 times faster inference than PaLI-3. To our knowledge, this is the first VLM to eliminate both convolutions and self-attention while matching mid-scale transformer baselines. In addition to its O(L log L) complexity, the shared frequency dictionary enables transparent cross-modal alignment and offers a tunable trade-off between accuracy and compute, paving the way for efficient and interpretable VLMs.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.