Papers
Topics
Authors
Recent
Search
2000 character limit reached

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

Published 10 Apr 2026 in cs.CL | (2604.09442v1)

Abstract: UI-to-Code generation requires vision-LLMs (VLMs) to produce thousands of tokens of structured HTML/CSS from a single screenshot, making visual token efficiency critical. Existing compression methods either select tokens at inference time using task-agnostic heuristics, or zero out low-attention features without actually shortening the sequence -- neither truly reduces prefill latency or adapts to the non-uniform information density of UI screenshots. Meanwhile, optical (encoder-side learned) compression has shown strong results for document OCR, yet no prior work has adapted this paradigm to UI-to-Code generation. We propose UIPress, a lightweight learned compression module inserted between the frozen ViT encoder and the LLM decoder of Qwen3-VL-8B. UIPress combines depthwise-separable convolutions, element-guided spatial reweighting, and Transformer refinement to compress ${\sim}$6{,}700 visual tokens to a fixed budget of 256. Together with Low-Rank Adaptation (LoRA) on the decoder to bridge the representation gap, the entire system adds only ${\sim}$21.7M trainable parameters (0.26\% of the 8B base model). Under a fair comparison on the same base model against four baselines on Design2Code, UIPress at 256 tokens achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5\% and the strongest inference-time method by +4.6\%, while delivering 9.1$\times$ time-to-first-token speedup. To the best of our knowledge, UIPress is the first encoder-side learned compression method for the UI-to-Code task.

Summary

  • The paper introduces a learned optical token compression technique that compresses UI screenshots to 256 tokens, achieving up to 7.5% CLIP score improvement over uncompressed outputs.
  • It integrates a cascaded convolutional compressor with LoRA-adapted decoding, yielding a 9.1× speedup in time-to-first-token with only 0.26% extra parameters.
  • Empirical evaluations and ablation studies demonstrate that UIPress outperforms standard and inference-time token reduction methods in both quality and efficiency.

UIPress: Learned Optical Token Compression for Efficient UI-to-Code Generation

Introduction and Motivation

Automated UI-to-Code generation—translating UI screenshots into functional HTML/CSS—poses substantial challenges to vision–LLMs (VLMs) due to the extremely long output sequences required (1K–4K tokens) and the dense visual tokenization induced by modern ViT-based encoders. For instance, Qwen3-VL-8B yields ∼6,700 visual tokens per typical high-resolution page, dominating both memory and latency during decoding. Existing token reduction approaches, including inference-time heuristics (e.g., feature-zeroing, token selection by L2 norm, resolution scaling), only partially mitigate runtime overhead and often fail to align with the highly non-uniform information distribution in UI screenshots. Moreover, encoder-side (optical) compression, successful in OCR, has not yet been adapted to UI-to-Code, largely due to representation mismatch and lack of task specificity. Figure 1

Figure 1: Comparison between conventional visual token processing and the learned compression framework introduced by UIPress for UI-to-Code generation.

Methodological Framework of UIPress

UIPress introduces a modular, learned optical compression pipeline between a frozen ViT encoder and the Qwen3-VL-8B LLM decoder. The framework encompasses three key components:

  1. Convolutional Optical Compressor: Utilizes cascaded depthwise-separable convolutions for spatial downsampling (4×), followed by adaptive pooling and element-guided token reweighting using OmniParser V2 detections. This yields a compressed, fixed-length sequence (typically K=256K=256, down from 6,700\sim 6,700).
  2. Decoder Adaptation via LoRA: Low-Rank Adaptation is jointly trained on all query and value projections of the frozen LLM decoder, bridging the representation gap without incurring the cost of full model fine-tuning.
  3. End-to-End Joint Training: The combination of compressor and LoRA adapters is optimized using a standard autoregressive objective on 50K WebSight screenshot–HTML pairs.

The total added parameter count is remarkably low—\sim21.7M (0.26% of base parameters)—positioning UIPress as both practical and efficient. Figure 2

Figure 2: Schematic overview of UIPress, detailing ViT encoding, convolutional compression, element-guided pooling, Transformer refinement, and LoRA-augmented decoding.

Empirical Results and Comparative Analysis

UIPress is evaluated on Design2Code, pitted against Qwen3-VL-8B with no compression and four inference-time compression baselines (resolution scaling, VisionZip, EfficientUICoder, FastV). Key highlights and strong claims include:

  • Substantial quality improvement under strong compression: UIPress at K=256K=256 (25.5×\times compression) achieves a CLIP score of 0.8127, outperforming the uncompressed baseline by +7.5%, the strongest inference-time method by +4.6%, and VisionZip at equal token count by +10.8%, with a 9.1× speedup in time-to-first-token.
  • Ablation studies indicate that LoRA adaptation is responsible for the majority of the gain, but all system components (convolutional compression, transformer refinement, element reweighting) are synergistic.
  • Aggressive compression with minimal performance drop: The method demonstrates a sharp quality–efficiency elbow at K=256K=256, with further reduction to K=128K=128 inducing a 10.8% CLIP loss (Figure 3). Figure 3

    Figure 3: Token count ablation reveals an optimal trade-off at K=256K{=}256, surpassing non-compressed and resolution-scaled baselines.

Training curves (Figure 4) show clear monotonic ascent on validation CLIP with no discernible overfitting, attesting to stable convergence properties. Figure 4

Figure 4: UIPress training dynamics: validation CLIP increases from 0.7232 (random init) to a peak of 0.8127 (epoch 17), surpassing all baselines.

Pareto plots of compression-quality trade-offs (Figure 5) position UIPress-256 as strictly dominant among all tested configurations. Figure 5

Figure 5: Comparative plot of method–budget configurations demonstrates UIPress-256’s Pareto optimality for compression vs. CLIP.

Qualitative comparisons (Figure 6) confirm that even at extreme compression, UIPress-generated HTML repros maintain sidebar layouts, preserve fine structure, and avoid content collapse that afflicts non-learned baselines. Figure 6

Figure 6: Qualitative case: UIPress best preserves sidebar structure and section-level styling at 256 tokens, with superior CLIP versus all baselines.

Theoretical Perspective and Broader Implications

UIPress is, to date, the first encoder-side learned compression adapted to UI-to-Code. Its efficacy is underpinned by:

  • Task-aligned bottlenecking: The framework harnesses spatial redundancy and non-uniform semantic density in page screenshots via learned, element-prioritized pooling. The information bottleneck interpretation clarifies why high compression rates are plausible—UI pages are heavily over-tokenized by ViT encoders.
  • Joint representational adaptation: Introduction of LoRA to the decoder is essential for mitigating distribution mismatch, demonstrating that lightweight adaptation is necessary for effective utilization of novel compressed representations.
  • “Beyond-lossless” compression: Contrary to naive information-theoretic bounds, learned bottlenecks can improve performance by denoising or focusing model capacity, a result directly observed in both empirical metrics and qualitative fidelity.

Practically, UIDPress enables the deployment of high-fidelity UI code generation with order-of-magnitude latency and memory reductions—a substantial advance toward real-world multimodal agents capable of synthesis or automated frontend engineering.

Limitations and Future Prospects

Notable limitations include:

  • Reliance on large (50K) task-specific training sets for optimal performance.
  • Fixed-token budgets per instance; adaptive allocation according to page complexity is an open problem.
  • CLIP is the principal metric—evaluation on fine-grained structural or semantic correctness remains an area for follow-up.
  • Reimplementation for mobile, Figma, or native apps would require domain adaptation.

The theoretical angle suggests further exploration along the lines of dynamic, instance-adaptive compression strategies, extending to other multimodal tasks (e.g., complex document synthesis, GUI automation), and possible co-design of visual encoder and bottleneck modules.

Conclusion

UIPress establishes encoder-side optical compression as a viable, high-yield approach to visual token reduction in UI-to-Code tasks. By combining convolutional downsampling, task-specific reweighting, transformer refinement, and lightweight decoder adaptation, it achieves significant gains in both output quality and computational efficiency—realizing truly practical, latency-aware UI-to-Code systems with minimal parameter overhead. This paradigm is likely extensible to broader classes of vision-language tasks exhibiting similar long-range, non-uniform, and semantically structured input domains.

Reference: "UIPress: Bringing Optical Token Compression to UI-to-Code Generation" (2604.09442)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.