Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 83 tok/s
Gemini 2.5 Flash 150 tok/s Pro
Gemini 2.5 Pro 48 tok/s Pro
Kimi K2 190 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

UI2Codeⁿ: UI Generation & Narayana Coding

Updated 13 November 2025
  • UI2Codeⁿ is a dual-term: one part is a cutting-edge visual language model for interactive UI-to-code synthesis with iterative feedback and scalability.
  • The model employs a multimodal Transformer that fuses visual encoding and language decoding using pre-training, supervised fine-tuning, and reinforcement learning.
  • Separately, UI2Codeⁿ defines a Narayana-based universal integer code that guarantees unique, prefix-free representations with logarithmic growth and proven optimality.

UI2CodeN^\text{N} is a unified term designating two rigorously specified, entirely unrelated schemes: a contemporary visual LLM for interactive UI-to-code generation (Yang et al., 11 Nov 2025), and a universal binary integer code governed by the Narayana sequence (Kirthi et al., 2016). The dual usage reflects both modern AI system design and mathematical coding theory, each with distinct formalisms, objectives, and operational mechanics. Only the connection in nomenclature links these topics.

1. Visual LLM for Interactive UI-to-Code Generation

The first and currently prominent usage of UI2CodeN^\text{N} designates an open-source visual LLM (VLM) architecture tailored for automatic, interactive, and test-time scalable user interface generation from rendered screenshots. UI2CodeN^\text{N} introduces foundational advances in multimodal UI coding by unifying three core capabilities: direct UI-to-code synthesis, UI editing via natural language instruction, and iterative UI code polishing. This system explicitly addresses the limitations of underdeveloped multimodal reasoning and non-interactive paradigms found in prior work.

1.1 Model Architecture

UI2CodeN^\text{N} utilizes a multimodal encoder–decoder Transformer with the following building blocks:

  • Visual Encoder: A ViT-style backbone employing patch embedding, positional encoding, and self-attention, operating on UI screenshots or renders.
  • Language Decoder: A causal Transformer generating structured code tokens (typically HTML, CSS, JS).
  • Cross-modal Fusion: Decoder layers incorporate cross-attention to visual encoder outputs:

Hcross=softmax(QdecKvisd)Vvis,H_{\text{cross}} = \mathrm{softmax}\left(\frac{Q_{\text{dec}}K_{\text{vis}}^{\top}}{\sqrt{d}}\right)V_{\text{vis}},

where QdecRT×dQ_{\text{dec}} \in \mathbb{R}^{T \times d} are the decoder queries, (Kvis,Vvis)RNpatch×d(K_{\text{vis}}, V_{\text{vis}}) \in \mathbb{R}^{N_{\text{patch}} \times d} are keys/values.

Inputs and decoding prompts distinguish among three tasks:

  • UI-to-code: Screenshot \rightarrow code, with special > …<answer>…</answer> scaffolding.
  • UI polishing: Target screenshot, initial code tokens, and the rendered output \rightarrow refined code.
  • UI editing: Reference screenshot, prior code, and text instruction \rightarrow revised code.

The same model core is reused, altering only modality subcomponents and task-specific input templates.

1.2 Training Methodology

The UI2CodeN^\text{N} training pipeline consists of three stages:

  1. Continual Pre-training: Approximately 10710^7 real webpage pairs (screenshot + HTML) and 2×1062 \times 10^6 synthetic pairs are used, interleaved with general VLM objectives (captioning, VQA, OCR, video QA). Primary objectives are:
    • Masked Language Modeling (MLM) on code tokens:

    LMLM=tMlogp(xtx<t,I)\mathcal{L}_{\mathrm{MLM}} = -\sum_{t \in M} \log p(x_t | x_{<t}, I)

  • GUI referring, predicting HTML spans from screenshots and DOM locations:

    Lspan=i=1Llogp(hih<i,I,bbox)\mathcal{L}_{\mathrm{span}} = -\sum_{i=1}^L \log p(h_i | h_{<i}, I, \mathrm{bbox})

  • Image–Text Contrastive Learning:

    LCTR=ilogexp(sim(Ii,Ti)/τ)jexp(sim(Ii,Tj)/τ)\mathcal{L}_{\mathrm{CTR}} = -\sum_{i}\log \frac{\exp(\mathrm{sim}(I_i, T_i) / \tau)}{\sum_j \exp(\mathrm{sim}(I_i, T_j)/\tau)}

  1. Supervised Fine-tuning (SFT): Uses $80,000$ curated examples with scaffolding for all three tasks; optimizes cross-entropy on canonical target outputs for each task.

  2. Reinforcement Learning (RL): Policy gradient (GRPO) on $12,000$ real + $30,000$ synthetic rollouts. Reward signals are derived from a VLM-based verifier (GLM-4.5V), scoring image similarity. Enhanced by a human-aligned “comparator” and round-robin ranking:

J(θ)=Ea1:Tπθ[t=1Trt],θJ(θ)=E[rtθlogπθ(atst)]J(\theta) = \mathbb{E}_{a_{1:T} \sim \pi_\theta}\Bigl[\sum_{t=1}^T r_t\Bigr],\quad \nabla_\theta J(\theta)=\mathbb{E}\left[r_t \nabla_\theta\log\pi_\theta(a_t|s_t)\right]

1.3 Multi-Turn Interactive Workflow

A hallmark of UI2CodeN^\text{N} is test-time scaling, leveraging iterative visual feedback:

1
2
3
4
5
6
7
8
procedure INTERACTIVE_UI2CODE(I_image, N_rounds):
  C⁰ ← GenerateCode(I_image)
  R⁰ ← Render(C⁰)
  for t in 1..N_rounds−1:
    Input ← (I_image, Cᵗ⁻¹, Rᵗ⁻¹)
    Cᵗ ← PolishingModel(Input)
    Rᵗ ← Render(Cᵗ)
  return C^{N_rounds−1}, R^{N_rounds−1}
Each round typically improves CLIP/VLM score by 2–4%.

2. Evaluation and Empirical Performance

Evaluation on several public and proprietary UI-to-code and UI-polishing benchmarks demonstrates state-of-the-art open-source performance and competitiveness with leading closed-source models (Claude-4-Sonnet, GPT-5).

Model Design2 Acc Flame Acc Web2 Acc UI-Polish Synth UI-Polish Real
InternVL3-78B 30.0 51.3 45.5 15% 10%
Qwen2.5-VL-72B 41.9 46.3 64.1 38% 23%
GLM-4.1V-9B 64.7 72.5 71.3 46% 42%
Claude-4-Sonnet 81.2 76.3 85.1 65% 78%
Gemini-2.5-Pro 89.5 87.5 90.6 68% 74%
GPT-5 89.7 91.3 93.7 68% 85%
UI2CodeN^\text{N}-9B-RL 88.6 95.0 92.5 94% 80%

Underlined values = best overall; bold = best open-source.

Test-time scaling reveals that UI2CodeN^\text{N} achieves further gains as polish rounds increase (real data: 66% [1], 68% [2], 70% [3], 73% [4], 74% [5]), and ablation shows RL reward tuning and real data have strong positive effects on real-world benchmarks.

3. Practical Applications and Limitations

UI2CodeN^\text{N} is immediately applicable to automated code generation from UI screens, iterative UI refinement, and natural language-driven editing of preexisting UIs. Representative qualitative behaviors include:

  • Correction of style-level errors (e.g., button padding, typographical exactness after polishing).

  • Restoration of complex layouts initially misrepresented in draft code.

  • Language-driven edits, such as changing navigation bar color or repositioning UI elements by code injection.

Limitations include inability to handle dynamic/interactive JavaScript widgets (such as carousels and modals), code truncation on inputs exceeding 32k tokens, and occasional pixel-level inaccuracies in sub-pixel rendering scenarios.

4. Universal Coding with the Narayana Sequence

The second sense of UI2CodeN^\text{N} refers to a mathematically rigorous, prefix-free, universal integer code with codeword lengths and enumeration determined by the Narayana sequence (Kirthi et al., 2016).

4.1 Narayana Sequence and Coding Basis

The classical Narayana sequence {Nk}k0\{N_k\}_{k\ge 0} is recursively defined by N0=N1=N2=1, Nk+1=Nk+Nk2N_0=N_1=N_2=1,\ N_{k+1}=N_k+N_{k-2}. Shifting by two gives a basis sequence {Ji}i0\{J_i\}_{i\ge 0} with Ji=Ni+2J_{i}=N_{i+2}, i.e. J0=1,J1=2,J2=3,J3=4,J_0 = 1, J_1 = 2, J_2 = 3, J_3 = 4,\ldots, and Ji=Ji1+Ji3J_i = J_{i-1} + J_{i-3}.

The dominant Narayana ratio L1.46557L \approx 1.46557 satisfies L3L21=0L^3-L^2-1=0 and controls the exponential growth rate of JiJ_i.

4.2 Encoding and Decoding Procedures

The UI2CodeN^\text{N} integer code is based on a Zeckendorf-type, nonconsecutive sum expansion in the JJ-basis.

Encoding

For integer nN+n\in \mathbb{N}^+:

  1. Compute maximal dd with Jdn<Jd+1J_d \le n < J_{d+1}.

  2. Find binary B{0,1}d+1B \in \{0,1\}^{d+1} such that:

    • i=0dBiJi=n\sum_{i=0}^d B_i J_i = n
    • Bd=1B_d=1
    • BiBi+1=0B_i B_{i+1} = 0 for 0i<d0\le i<d.
  3. The codeword c=B0B1...Bd1c=B_0 B_1 ... B_d\,1.

Decoding

Given binary string cc (ending in ‘1’):

  1. Discard terminating ‘1’, let length =c\ell=c.
  2. Set d=2d = \ell-2, and compute {J0,...,Jd}\{J_0, ... , J_d\}.
  3. n=i=0dciJin = \sum_{i=0}^d c_i J_i.

Both representations are unique because no two consecutive ‘1’s are present, and all codewords end with the signature “…11” that cannot appear elsewhere.

4.3 Theoretical Properties

  • Prefix Property: No codeword is a prefix of another due to unique “…11” ending and the no-consecutive-1s constraint.
  • Universality: Every n1n\ge 1 can be represented (see Theorem 2 in (Kirthi et al., 2016)).
  • Length formula: (n)=d(n)+2\ell(n) = d(n) + 2 with d(n)=max{i:Jin}d(n) = \max\{i: J_i \le n\}.
  • Asymptotics: (n)logLn+O(1)\ell(n) \simeq \log_L n + O(1), so growth is logarithmic, with base LL.
  • Enumeration: Number of codewords of length \ell is J2J_{\ell-2}.
  • Redundancy: Tends to zero; the scheme encodes every nn deterministically and prefix-free without prior distributional knowledge.
  • Optimality: Alternative bases (e.g., shifted Narayana) either omit some nn or lose uniqueness.

4.4 Algorithmic Realizations and Examples

Encoding pseudocode:

1
2
3
4
5
6
7
8
9
10
11
function ENCODE_UI2Coden(n):
  Compute J_i until J_d <= n < J_{d+1}
  r ← n
  for i = d downto 0:
    if J_i <= r:
      B_i ← 1
      r ← r - J_i
    else:
      B_i ← 0
  c ← concat(B_0 ... B_d, '1')
  return c
Decoding is the reverse: sum the JiJ_i's at index ii where ci=1c_i=1 after dropping the last bit.

Examples:

nn JJ-expansion Vector BB Codeword cc
1 1J01 \cdot J_0 (1) 11
2 1J11 \cdot J_1 (0, 1) 0 1 1
3 1J21 \cdot J_2 (0, 0, 1) 0 0 1 1
5 1J3+1J01\cdot J_3+1\cdot J_0 (1,0,0,1) 1 0 0 1 1

4.5 Comparative Remarks

The UI2CodeN^\text{N} code grows faster than codes based on Fibonacci sequences but more slowly than binary, with an exponent determined by the Narayana ratio. It is a “universal” code in the Elias sense.

5. Significance and Context

The VLM UI2CodeN^\text{N} sets a new state-of-the-art in open-source visual UI coding, matched closely only by proprietary VLMs. The test-time scalable, multi-turn interaction paradigm allows for systematic improvements leveraging visual feedback, closing the gap between automated and human-in-the-loop design workflows. The code was released publicly, enabling further research.

The UI2CodeN^\text{N} universal integer code constitutes an independent advance in prefix coding theory, extending the family of Fibonacci- and Lucas-based codes by employing the Narayana basis. Its optimality is tied to the combinatorial properties of the sequence, and no alternative shift or variant achieves the same universality and uniqueness guarantees.

A plausible implication is that the dual manifestation of the UI2CodeN^\text{N} term—concurrent in AI systems and in coding theory—reflects the broader interplay between semantic representation learning and rigorous symbolic codification. However, there is currently no technical connection beyond naming and adherence to strict formalism.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UI2Code$^\text{N}$.