Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 188 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

EfficientUICoder: Efficient MLLM-based UI Code Generation via Input and Output Token Compression (2509.12159v1)

Published 15 Sep 2025 in cs.SE and cs.AI

Abstract: Multimodal LLMs have demonstrated exceptional performance in UI2Code tasks, significantly enhancing website development efficiency. However, these tasks incur substantially higher computational overhead than traditional code generation due to the large number of input image tokens and extensive output code tokens required. Our comprehensive study identifies significant redundancies in both image and code tokens that exacerbate computational complexity and hinder focus on key UI elements, resulting in excessively lengthy and often invalid HTML files. We propose EfficientUICoder, a compression framework for efficient UI code generation with three key components. First, Element and Layout-aware Token Compression preserves essential UI information by detecting element regions and constructing UI element trees. Second, Region-aware Token Refinement leverages attention scores to discard low-attention tokens from selected regions while integrating high-attention tokens from unselected regions. Third, Adaptive Duplicate Token Suppression dynamically reduces repetitive generation by tracking HTML/CSS structure frequencies and applying exponential penalties. Extensive experiments show EfficientUICoderachieves a 55%-60% compression ratio without compromising webpage quality and delivers superior efficiency improvements: reducing computational cost by 44.9%, generated tokens by 41.4%, prefill time by 46.6%, and inference time by 48.8% on 34B-level MLLMs. Code is available at https://github.com/WebPAI/EfficientUICoder.

Summary

The paper introduces a bidirectional token compression framework that reduces both visual and code token redundancies in MLLM-based UI code generation.
It combines element detection, attention-guided token refinement, and adaptive duplicate suppression to achieve significant improvements in FLOPs, token count, and inference speed.
Experimental results demonstrate up to 60% compression and enhanced output quality, validated by both automatic metrics and human evaluations.

EfficientUICoder: Bidirectional Token Compression for Efficient MLLM-Based UI Code Generation

Introduction and Motivation

EfficientUICoder addresses the computational inefficiencies inherent in Multimodal LLM (MLLM)-based UI-to-code (UI2Code) tasks. These inefficiencies stem from the excessive number of input image tokens and output code tokens required to represent complex webpage designs and their corresponding HTML/CSS implementations. Empirical analysis reveals that over 90% of token consumption in UI2Code benchmarks is attributed to image and code tokens (Figure 1).

Figure 1: The token ratios of different datasets.

The paper identifies two primary sources of redundancy: (1) visual token redundancy, where background and non-informative regions inflate input sequence length and distract model attention, and (2) code token redundancy, where MLLMs frequently generate duplicate HTML/CSS structures and textual content, leading to invalid or excessively lengthy outputs. These redundancies result in quadratic growth in computational complexity due to the self-attention mechanism's dependence on sequence length.

Redundancy Analysis in UI2Code Tasks

Visual Token Redundancy

Attention score visualizations using CLIP-based encoders demonstrate that only a minority of tokens correspond to semantically meaningful UI elements, while the majority are allocated to background regions (Figure 2). This misallocation not only increases computational cost but also degrades model focus on critical interface components.

Figure 2: Visual encoders' attention score visualization and distribution on two webpages.

Code Token Redundancy

Manual and automated analysis of MLLM outputs on Design2Code and WebCode2M datasets reveals three categories of code redundancy: repeated CSS selectors/properties, redundant HTML structural patterns, and duplicate textual content (Figure 3). These patterns can persist until the model's token limit is reached, resulting in invalid HTML and wasted computation.

Figure 3: Duplicate code token examples.

EfficientUICoder Framework

EfficientUICoder introduces a bidirectional compression pipeline comprising three modules:

Element and Layout-aware Token Compression (ELTC): Utilizes UI element detection (UIED) to extract bounding boxes for all UI components, merges fragmented text regions, and constructs a UI element graph. A minimum spanning tree (MST) algorithm is applied to preserve essential layout relationships while minimizing token count.
Region-aware Token Refinement (RTR): Refines the token set by leveraging attention scores from the visual encoder. Low-attention tokens within selected regions are dropped, while high-attention tokens from unselected regions (e.g., informative backgrounds) are incorporated. The refinement ratio $r$ is tuned to balance compression and semantic preservation.
Adaptive Duplicate Token Suppression (ADTS): During decoding, HTML/CSS and text repetition frequencies are tracked via custom parsers. An exponential penalty is applied to the logits of repeated tokens over a suppression window $s$ , with decay factor $\lambda$ , dynamically reducing the probability of redundant token generation.
Figure 4: EfficientUICoder framework.

Experimental Results

Performance and Compression

EfficientUICoder achieves 55–60% image token compression and 56–94% code redundancy reduction without compromising UI2Code performance. On Design2Code and WebCode2M, EfficientUICoder outperforms vanilla and baseline compression methods (Random, FastV, Pdrop, VisionZip) across block-match, text, position, color, CLIP, and BLEU metrics. Notably, EfficientUICoder sometimes surpasses vanilla in certain metrics due to improved focus on key UI elements and effective redundancy suppression.

Human Evaluation

Human annotators consistently prefer EfficientUICoder outputs over baselines, corroborating automatic metric results (Figure 5).

Figure 5: Human Evaluation on Design2Code.

Efficiency Gains

EfficientUICoder delivers substantial efficiency improvements: up to 44.9% reduction in FLOPs, 41.4% fewer generated tokens, 46.6% lower prefill time, and 48.8% faster inference time on 34B-level MLLMs. These gains are robust across datasets and model scales.

Figure 6: Inference time under different $\lambda$ and $r$ .

Ablation and Parameter Studies

Ablation experiments confirm that each module (ELTC, RTR, ADTS) contributes positively to performance and efficiency, with the full combination yielding optimal results. Parameter sweeps show that moderate suppression steps ( $s=3$ ) and decay factors ( $\lambda=0.5$ ) best balance redundancy penalization and output quality. The refinement ratio $r$ is optimal at 5–10%, and adding more tokens from unselected regions yields diminishing returns (Figures 7, 8, 10).

Figure 7: Suppression step $s$ and decay factor $\lambda$ analysis on Design2Code datatset.

Figure 8: Suppression step $s$ and decay factor $\lambda$ analysis on WebCode2M datatset.

Figure 9: The performance and efficiency on two datasets when adding unselected token. D2C denotes Design2Code and WC2M denotes WebCode2M.

Case Studies

Qualitative analysis demonstrates that EfficientUICoder reliably selects tokens covering all critical UI elements, avoids background redundancy, and prevents cyclical code generation. In contrast, baseline methods either omit essential components or generate excessive duplicates, resulting in incomplete or invalid webpages (Figures 11, 12).

Figure 10: Case paper of webpages generated by Vanilla, EfficientUICoder and VisionZip.

Figure 11: Token selection results.

Implementation Considerations

EfficientUICoder requires access to the visual encoder's attention maps and the ability to modify both encoding and decoding stages of the MLLM. This restricts deployment to open-source models (e.g., Llava-v1.6) rather than commercial black-box APIs. The framework is compatible with standard UI element detection tools and can be integrated into existing MLLM pipelines with moderate engineering effort. Resource requirements scale with model size, but the compression modules substantially mitigate memory and compute bottlenecks.

Implications and Future Directions

EfficientUICoder establishes that bidirectional token compression—jointly optimizing input and output redundancy—can dramatically improve the efficiency and reliability of MLLM-based UI2Code systems. The approach is extensible to other multimodal code generation tasks (e.g., slide, poster, or app synthesis) and can be adapted for real-time or resource-constrained deployment scenarios. Future work may explore adaptive compression strategies conditioned on design complexity, integration with commercial MLLMs as APIs evolve, and further refinement of redundancy detection via learned or self-supervised methods.

Conclusion

EfficientUICoder provides a principled, empirically validated solution to the computational challenges of MLLM-based UI code generation. By combining element/layout-aware token selection, attention-guided refinement, and dynamic output suppression, it achieves high compression ratios and efficiency gains without sacrificing output fidelity. The framework advances the state of the art in multimodal code generation and offers a robust foundation for scalable, practical deployment in web development automation.