The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Published 3 Apr 2026 in cs.RO, cs.CV, and cs.LG | (2604.03191v1)

Abstract: Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.

Abstract PDF Upgrade to Chat

Authors (1)

Takuya Shiba

Summary

The paper introduces the Compression Gap principle, showing that fixed-capacity discrete tokenization limits the propagation of vision encoder improvements in VLA models.
Empirical results on the LIBERO benchmark reveal that continuous action representations gain over 21 percentage points in success rate, while discrete ones gain at most 10 percentage points.
The findings suggest practical strategies for physical AI scaling, recommending either continuous action prediction or adaptive codebooks to overcome bottlenecks.

Information Bottlenecks in Vision-Language-Action Model Scaling

Principle of Compression Gap

The paper establishes the Compression Gap as an information-theoretic principle governing Vision-Language-Action (VLA) model scaling. Within visuomotor pipelines, the tightest information bottleneck dictates the efficacy of component scaling. When actions are parameterized in a continuous domain (e.g., Diffusion Policy), the downstream task performance is tightly coupled to vision encoder quality—upgrading the encoder enhances action prediction via expanded mutual information $I(O; Z)$ . Conversely, discrete tokenization approaches (e.g., OAT) introduce a fixed-capacity codebook, where mutual information $I(Z; T)$ is upper-bounded by the codebook’s entropy $H_l \log_2 |\mathcal{V}|$ . This constraint severs the dependency between encoder richness and downstream manipulation capability, since all upstream improvements are quantized and discarded if the codebook saturates.

Empirical Validation and Numerical Evidence

The authors validate their theoretical framework using the LIBERO benchmark. Three experimental axes are explored: encoder upgrades, encoder quality gradients, and codebook size variation. For continuous action representations, model performance benefits from vision encoder improvements with increases exceeding 21 percentage points in success rate. Discrete action representations, however, show attenuated gains ( $\leq$ 10 percentage points), confirming the hypothesis that scaling effects are blocked at the codebook bottleneck. Extending this to multiple vision encoders (ResNet-18, SigLIP, DINOv2, SigLIP 2), only continuous pathways register monotonically increasing performance with encoder quality; discrete pathways remain flat and insensitive. Additionally, increasing codebook capacity in OAT partially recovers sensitivity to encoder quality, providing causal evidence for the Compression Gap.

Implications for Physical AI Scalability

The study demonstrates that scaling strategies must be bottleneck-aware—uniformly increasing data, parameter count, or encoder quality does not guarantee downstream performance improvements unless the pipeline’s structural constraints are considered. In discrete tokenization, lossy compression stages impose hard bounds on information throughput that invalidate the intuition that “better vision equals better manipulation.” This conclusion generalizes to other discrete tokenization methods (e.g., FAST, Binning, VQ-BeT), where fixed-capacity codebooks create analogous bottlenecks. Continuous representations allow end-to-end propagation of improvements, supporting independent scaling of components in complex multimodal systems.

Practical Strategies and Theoretical Extensions

Practically, these results recommend caution when integrating foundation vision encoders with discrete action tokenizers. The actionable guidance is twofold: either eliminate discrete bottlenecks via continuous action prediction, or employ adaptive codebooks with sufficiently large capacity to remain non-binding. Hybrid architectures balancing discrete reasoning and continuous decoding might offer complementary benefits. Theoretical extensions are possible: the data processing inequality and mutual information bounds apply to any compression stage—observation encoding, action representation, inter-module communications, etc. Identifying and relieving structural bottlenecks will be critical for scalable VLA and general Physical AI.

Limitations and Prospects

Limitations include the focus on LIBERO-10, a restricted set of vision encoders, and two model scales. Avenues for future work include validation on additional benchmarks, exploration of vision encoders targeting task-relevant perceptual features (e.g., affordance, geometry, spatial dynamics), and systematic study of information bottlenecks in modular AI pipelines. The observed performance plateau with state-of-the-art vision encoders in continuous pathways suggests the need for novel encoders optimized for physical interaction, distinct from semantic or self-supervised learning regimes.

Conclusion

This paper rigorously differentiates scaling behavior in VLA models as a function not merely of encoder quality, but of the structural location of information bottlenecks. Discrete action tokenization fundamentally restricts the propagation of upstream improvements, while continuous representation enables componentwise scaling. These findings underscore the necessity of bottleneck-aware architecture design in Physical AI, shifting emphasis from uniform scaling to strategic identification and elimination of lossy compression stages. The Compression Gap framework has immediate utility for model selection, pipeline optimization, and future research into scalable, multimodal AI systems.