UI2Codeⁿ: UI Generation & Narayana Coding

Updated 13 November 2025

UI2Codeⁿ is a dual-term: one part is a cutting-edge visual language model for interactive UI-to-code synthesis with iterative feedback and scalability.
The model employs a multimodal Transformer that fuses visual encoding and language decoding using pre-training, supervised fine-tuning, and reinforcement learning.
Separately, UI2Codeⁿ defines a Narayana-based universal integer code that guarantees unique, prefix-free representations with logarithmic growth and proven optimality.

UI2Code $^\text{N}$ is a unified term designating two rigorously specified, entirely unrelated schemes: a contemporary visual LLM for interactive UI-to-code generation (Yang et al., 11 Nov 2025), and a universal binary integer code governed by the Narayana sequence (Kirthi et al., 2016). The dual usage reflects both modern AI system design and mathematical coding theory, each with distinct formalisms, objectives, and operational mechanics. Only the connection in nomenclature links these topics.

1. Visual LLM for Interactive UI-to-Code Generation

The first and currently prominent usage of UI2Code $^\text{N}$ designates an open-source visual LLM (VLM) architecture tailored for automatic, interactive, and test-time scalable user interface generation from rendered screenshots. UI2Code $^\text{N}$ introduces foundational advances in multimodal UI coding by unifying three core capabilities: direct UI-to-code synthesis, UI editing via natural language instruction, and iterative UI code polishing. This system explicitly addresses the limitations of underdeveloped multimodal reasoning and non-interactive paradigms found in prior work.

1.1 Model Architecture

UI2Code $^\text{N}$ utilizes a multimodal encoder–decoder Transformer with the following building blocks:

Visual Encoder: A ViT-style backbone employing patch embedding, positional encoding, and self-attention, operating on UI screenshots or renders.
Language Decoder: A causal Transformer generating structured code tokens (typically HTML, CSS, JS).
Cross-modal Fusion: Decoder layers incorporate cross-attention to visual encoder outputs:

$H_{\text{cross}} = \mathrm{softmax}\left(\frac{Q_{\text{dec}}K_{\text{vis}}^{\top}}{\sqrt{d}}\right)V_{\text{vis}},$

where $Q_{\text{dec}} \in \mathbb{R}^{T \times d}$ are the decoder queries, $(K_{\text{vis}}, V_{\text{vis}}) \in \mathbb{R}^{N_{\text{patch}} \times d}$ are keys/values.

Inputs and decoding prompts distinguish among three tasks:

UI-to-code: Screenshot $\rightarrow$ code, with special > …<answer>…</answer> scaffolding.
UI polishing: Target screenshot, initial code tokens, and the rendered output $\rightarrow$ refined code.
UI editing: Reference screenshot, prior code, and text instruction $\rightarrow$ revised code.

The same model core is reused, altering only modality subcomponents and task-specific input templates.

1.2 Training Methodology

The UI2Code $^\text{N}$ training pipeline consists of three stages:

Continual Pre-training: Approximately $10^7$ $1 0^{7}$ real webpage pairs (screenshot + HTML) and $2 \times 10^6$ $2 \times 1 0^{6}$ synthetic pairs are used, interleaved with general VLM objectives (captioning, VQA, OCR, video QA). Primary objectives are:
- Masked Language Modeling (MLM) on code tokens:
$\mathcal{L}_{\mathrm{MLM}} = -\sum_{t \in M} \log p(x_t | x_{<t}, I)$

GUI referring, predicting HTML spans from screenshots and DOM locations:

$\mathcal{L}_{\mathrm{span}} = -\sum_{i=1}^L \log p(h_i | h_{<i}, I, \mathrm{bbox})$
Image–Text Contrastive Learning:

$\mathcal{L}_{\mathrm{CTR}} = -\sum_{i}\log \frac{\exp(\mathrm{sim}(I_i, T_i) / \tau)}{\sum_j \exp(\mathrm{sim}(I_i, T_j)/\tau)}$

Supervised Fine-tuning (SFT): Uses $80,000$ curated examples with scaffolding for all three tasks; optimizes cross-entropy on canonical target outputs for each task.
Reinforcement Learning (RL): Policy gradient (GRPO) on $12,000$ real + $30,000$ synthetic rollouts. Reward signals are derived from a VLM-based verifier (GLM-4.5V), scoring image similarity. Enhanced by a human-aligned “comparator” and round-robin ranking:

$J(\theta) = \mathbb{E}_{a_{1:T} \sim \pi_\theta}\Bigl[\sum_{t=1}^T r_t\Bigr],\quad \nabla_\theta J(\theta)=\mathbb{E}\left[r_t \nabla_\theta\log\pi_\theta(a_t|s_t)\right]$

1.3 Multi-Turn Interactive Workflow

A hallmark of UI2Code $^\text{N}$ is test-time scaling, leveraging iterative visual feedback:

procedure INTERACTIVE_UI2CODE(I_image, N_rounds):
  C⁰ ← GenerateCode(I_image)
  R⁰ ← Render(C⁰)
  for t in 1..N_rounds−1:
    Input ← (I_image, Cᵗ⁻¹, Rᵗ⁻¹)
    Cᵗ ← PolishingModel(Input)
    Rᵗ ← Render(Cᵗ)
  return C^{N_rounds−1}, R^{N_rounds−1}

Each round typically improves CLIP/VLM score by 2–4%.

2. Evaluation and Empirical Performance

Evaluation on several public and proprietary UI-to-code and UI-polishing benchmarks demonstrates state-of-the-art open-source performance and competitiveness with leading closed-source models (Claude-4-Sonnet, GPT-5).

Model	Design2 Acc	Flame Acc	Web2 Acc	UI-Polish Synth	UI-Polish Real
InternVL3-78B	30.0	51.3	45.5	15%	10%
Qwen2.5-VL-72B	41.9	46.3	64.1	38%	23%
GLM-4.1V-9B	64.7	72.5	71.3	46%	42%
Claude-4-Sonnet	81.2	76.3	85.1	65%	78%
Gemini-2.5-Pro	89.5	87.5	90.6	68%	74%
GPT-5	89.7	91.3	93.7	68%	85%
UI2Code $^\text{N}$ -9B-RL	88.6	95.0	92.5	94%	80%

Underlined values = best overall; bold = best open-source.

Test-time scaling reveals that UI2Code $^\text{N}$ achieves further gains as polish rounds increase (real data: 66% [1], 68% [2], 70% [3], 73% [4], 74% [5]), and ablation shows RL reward tuning and real data have strong positive effects on real-world benchmarks.

3. Practical Applications and Limitations

UI2Code $^\text{N}$ is immediately applicable to automated code generation from UI screens, iterative UI refinement, and natural language-driven editing of preexisting UIs. Representative qualitative behaviors include:

Correction of style-level errors (e.g., button padding, typographical exactness after polishing).
Restoration of complex layouts initially misrepresented in draft code.
Language-driven edits, such as changing navigation bar color or repositioning UI elements by code injection.

Limitations include inability to handle dynamic/interactive JavaScript widgets (such as carousels and modals), code truncation on inputs exceeding 32k tokens, and occasional pixel-level inaccuracies in sub-pixel rendering scenarios.

4. Universal Coding with the Narayana Sequence

The second sense of UI2Code $^\text{N}$ refers to a mathematically rigorous, prefix-free, universal integer code with codeword lengths and enumeration determined by the Narayana sequence (Kirthi et al., 2016).

4.1 Narayana Sequence and Coding Basis

The classical Narayana sequence $\{N_k\}_{k\ge 0}$ is recursively defined by $N_0=N_1=N_2=1,\ N_{k+1}=N_k+N_{k-2}$ . Shifting by two gives a basis sequence $\{J_i\}_{i\ge 0}$ with $J_{i}=N_{i+2}$ , i.e. $J_0 = 1, J_1 = 2, J_2 = 3, J_3 = 4,\ldots$ , and $J_i = J_{i-1} + J_{i-3}$ .

The dominant Narayana ratio $L \approx 1.46557$ satisfies $L^3-L^2-1=0$ and controls the exponential growth rate of $J_i$ .

4.2 Encoding and Decoding Procedures

The UI2Code $^\text{N}$ integer code is based on a Zeckendorf-type, nonconsecutive sum expansion in the $J$ -basis.

Encoding

For integer $n\in \mathbb{N}^+$ :

Compute maximal $d$ with $J_d \le n < J_{d+1}$ .
Find binary $B \in \{0,1\}^{d+1}$ such that:
- $\sum_{i=0}^d B_i J_i = n$
- $B_d=1$
- $B_i B_{i+1} = 0$ for $0\le i<d$ .
The codeword $c=B_0 B_1 ... B_d\,1$ .

Decoding

Given binary string $c$ (ending in ‘1’):

Discard terminating ‘1’, let length $\ell=c$ .
Set $d = \ell-2$ , and compute $\{J_0, ... , J_d\}$ .
$n = \sum_{i=0}^d c_i J_i$ .

Both representations are unique because no two consecutive ‘1’s are present, and all codewords end with the signature “…11” that cannot appear elsewhere.

4.3 Theoretical Properties

Prefix Property: No codeword is a prefix of another due to unique “…11” ending and the no-consecutive-1s constraint.
Universality: Every $n\ge 1$ can be represented (see Theorem 2 in (Kirthi et al., 2016)).
Length formula: $\ell(n) = d(n) + 2$ with $d(n) = \max\{i: J_i \le n\}$ .
Asymptotics: $\ell(n) \simeq \log_L n + O(1)$ , so growth is logarithmic, with base $L$ .
Enumeration: Number of codewords of length $\ell$ is $J_{\ell-2}$ .
Redundancy: Tends to zero; the scheme encodes every $n$ deterministically and prefix-free without prior distributional knowledge.
Optimality: Alternative bases (e.g., shifted Narayana) either omit some $n$ or lose uniqueness.

4.4 Algorithmic Realizations and Examples

Encoding pseudocode:

function ENCODE_UI2Coden(n):
  Compute J_i until J_d <= n < J_{d+1}
  r ← n
  for i = d downto 0:
    if J_i <= r:
      B_i ← 1
      r ← r - J_i
    else:
      B_i ← 0
  c ← concat(B_0 ... B_d, '1')
  return c

Decoding is the reverse: sum the

J_i

's at index

i

where

c_i=1

after dropping the last bit.

Examples:

$n$	$J$ -expansion	Vector $B$	Codeword $c$
1	$1 \cdot J_0$	(1)	11
2	$1 \cdot J_1$	(0, 1)	0 1 1
3	$1 \cdot J_2$	(0, 0, 1)	0 0 1 1
5	$1\cdot J_3+1\cdot J_0$	(1,0,0,1)	1 0 0 1 1

4.5 Comparative Remarks

The UI2Code $^\text{N}$ code grows faster than codes based on Fibonacci sequences but more slowly than binary, with an exponent determined by the Narayana ratio. It is a “universal” code in the Elias sense.

5. Significance and Context

The VLM UI2Code $^\text{N}$ sets a new state-of-the-art in open-source visual UI coding, matched closely only by proprietary VLMs. The test-time scalable, multi-turn interaction paradigm allows for systematic improvements leveraging visual feedback, closing the gap between automated and human-in-the-loop design workflows. The code was released publicly, enabling further research.

The UI2Code $^\text{N}$ universal integer code constitutes an independent advance in prefix coding theory, extending the family of Fibonacci- and Lucas-based codes by employing the Narayana basis. Its optimality is tied to the combinatorial properties of the sequence, and no alternative shift or variant achieves the same universality and uniqueness guarantees.

A plausible implication is that the dual manifestation of the UI2Code $^\text{N}$ term—concurrent in AI systems and in coding theory—reflects the broader interplay between semantic representation learning and rigorous symbolic codification. However, there is currently no technical connection beyond naming and adherence to strict formalism.

PDF Markdown Chat (Pro)

References (2)

UI2Code$^\text{N}$: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation (2025)

The Narayana Universal Code (2016)

Follow Topic

Get notified by email when new papers are published related to UI2Code$^\text{N}$.