Papers
Topics
Authors
Recent
Search
2000 character limit reached

TransBench: AI Transferability Benchmarks

Updated 29 March 2026
  • TransBench is a suite of large-scale benchmarks designed to evaluate AI transferability across diverse tasks, notably for GUI agent grounding and machine translation.
  • It assesses critical challenges such as cross-version, cross-platform, and cross-application generalization using detailed protocols and metrics like grounding accuracy and BLEU scores.
  • The benchmarks leverage real-world data and multi-stage human verification to drive robust, reproducible insights for improving AI model generalization in practical settings.

TransBench encompasses a series of state-of-the-art, large-scale benchmarks designed to rigorously quantify transferability in artificial intelligence systems across domains such as graphical user interface agents, machine translation, monocular height estimation, and repository-level code translation. Developed independently by different research groups, the various TransBench suites address distinct transfer learning challenges, with each tailored to expose the limits of current models and drive progress toward robust, generalizable solutions suitable for complex, real-world environments.

1. Purpose and Scope

TransBench refers to multiple distinct benchmarks with two prominent instantiations: (1) for transferable GUI agent grounding (Lu et al., 23 May 2025), (2) for industrial-scale machine translation evaluation (Li et al., 20 May 2025). Each instantiation provides a comprehensive dataset and standardized evaluation protocols that probe transferability in high-impact AI application areas where academic benchmarks are insufficient for capturing real-world challenges. The guiding intent is to bridge the generalization gap by evaluating models on axis-aligned shifts—such as platform, language, domain, application, or version transitions—that typically degrade performance outside laboratory settings.

2. Transferability Dimensions: GUI Agent Grounding

The TransBench suite for GUI agent grounding (Lu et al., 23 May 2025) is architected around three orthogonal transferability axes critical to practical digital workflows:

  1. Cross-Version Transferability: Tests a model’s ability to generalize grounding from one app version to its successor, confronting layout, icon, and text shifts, along with additions and removals of UI elements. For instance, a survey over 20 app version pairs found: 10.0% layout-only changes, 2.0% text/icon-only, 13.6% both, 30.8% additions, 14.0% deletions, and 29.6% invariant layouts.
  2. Cross-Platform Transferability: Evaluates grounding transfer from one platform (e.g., Android) to another (iOS, Web), factoring different interaction paradigms and DOM hierarchies. Models fine-tuned only on Android exhibit more brittle generalization to the Web, highlighting platform-specific representational gaps.
  3. Cross-Application Transferability: Assesses comprehension transfer between apps within and across functional categories (e.g., shopping to finance). Although semantic similarity offers slight gains, cross-platform and cross-version variances dominate model errors, indicating that superficial functional resemblance is insufficient for robustness.

3. Dataset Construction and Benchmark Protocols

TransBench for GUI agents encompasses 81 widely-used Chinese apps spanning 15 categories (e.g., Shopping, Video Streaming, Finance), each sampled across Android old/new, iOS, and Web platforms for a total of 1,459 screenshots:

Platform Apps Screenshots Bounding Boxes Instructions
Android (old) 77 393 17,455 5,696
Android (new) 80 432 19,384 6,305
iOS 81 429 14,477 6,046
Web 47 205 14,341 4,191

Semantic element annotation leverages automated detection (OmniParser) followed by human verification; natural-language instructions (22,000+) are generated and cross-validated, achieving verified correspondence at 95.5% accuracy.

TransBench for industrial MT targets performance in global e-commerce and finance, encompassing:

  • 17,000 e-commerce sentences, professionally translated across 33 language pairs and 4 main user scenarios.
  • 12,000 finance sentences in 60 language directions.

Scenario coverage encapsulates product titles, marketing material, customer service dialogs, and consumer reviews. Data provenance is established via real-world logs, with rigorous cleaning, desensitization, and multi-stage human annotation (double-blind verification, domain expert sign-off).

4. Task Definitions and Evaluation Metrics

4.1 GUI Agent Grounding

Task

Given a screenshot and a natural-language user instruction, the model outputs either a point (x,y)(x, y) or bounding box corresponding to the target GUI element.

Metrics

  • Grounding Accuracy: Acc=(1/N)k1[(xk,yk)bk]\text{Acc} = (1/N)\sum_k 1[(x_k, y_k)\in b_k], where a prediction is correct if inside the target bounding box bkb_k.
  • Average Distance: D=1Ni=1N(xi,yi)(x^i,y^i)2D = \frac{1}{N}\sum_{i=1}^N \|\,(x'_i, y'_i)-(\hat{x}'_i, \hat{y}'_i)\|_2, with all coordinates normalized to [0,100][0, 100] and (x^i,y^i)(\hat{x}'_i, \hat{y}'_i) denoting the center of the ground-truth box.

4.2 Machine Translation

Task

Translate source text in e-commerce or finance across diverse language pairs, retaining domain jargon, stylistic conventions, and cultural appropriateness.

Metrics

Metric Purpose Formula/Definition
BLEU General fluency/adequacy BLEU=BPexp(n=1Nwnlogpn)BLEU = BP \cdot \exp(\sum_{n=1}^N w_n \log p_n)
TER Edit-based similarity TER=#edits/referenceTER = \#edits/|\text{reference}|
chrF Character-based n-gram F-score chrF=(1+β2)(chrPchrR)/(β2chrP+chrR)chrF = (1+\beta^2)\cdot(chrP\cdot chrR)/(\beta^2\cdot chrP + chrR)
Marco-MOS Domain MOS regression (0–5 score) L(θ)=(1/M)i=1M(yifθ(Si,Hi))2L(\theta) = (1/M)\sum_{i=1}^M (y_i - f_\theta(S_i, H_i))^2
Taboo Accuracy Cultural safety: forbidden terms ACCtaboo=(1/T)sT1{no taboo in Hs}ACC_{taboo} = (1/|T|)\sum_{s\in T} 1\{\text{no taboo in } H_s\}
Honorific Acc. Cultural formality (JP/KR pairs) ACChon=(1/H)sH1{all required honorifics}ACC_{hon} = (1/|H|)\sum_{s\in H} 1\{\text{all required honorifics}\}

5. Empirical Findings

  • SOTA models (e.g., Qwen2.5VL) can achieve up to 89.6% grounding accuracy on Android; performance drops to 79.8% on Web.
  • Fine-tuning on old Android versions yields a +7.0% gain on new Android versions and confers cross-platform benefits, particularly improving iOS test accuracy (+4.96%).
  • Cross-application fine-tuning raises accuracy by 8–10%, but platform and version mismatches remain the primary impediment to transfer.
  • Error taxonomy: (1) incorrect GUI element, (2) near-misses outside ground-truth, (3) targets too large or ill-localized.
  • General-purpose MT models, absent domain adaptation, are insufficient for industrial translation, especially concerning style, terminology, and cultural adaptation.
  • Marco-MOS demonstrates higher alignment with human quality estimates than general LLM-based QE (ρ=0.65 versus 0.48 for GPT-4).
  • The benchmark’s multi-level protocol reveals quality “blind spots” that generic BLEU-focused evaluation does not capture, especially regarding honorific use and taboo avoidance.

6. Practical and Methodological Recommendations

  • For GUI agents, bootstrapping transfer with historical logs, multi-platform data fusions, and diverse app samples is critical; platform and version diversity in training data enhances generalization to less-represented settings (e.g., Web).
  • For MT, the multi-level capability framework (basic linguistic competence, domain-specific proficiency, cultural adaptation) should inform training, error analysis, and reporting protocols.
  • Open-source tools, scenario-wise dashboards, and layered quality estimation (e.g., Marco-MOS) are central for reproducible, actionable evaluation in both domains.

7. Limitations and Prospective Directions

  • Computational Overhead: Full training/fine-tuning for large VLM benchmarks (e.g., GUI agents) can require ~200 A800 GPU-hours.
  • Domain/Language Coverage: Many benchmarks (e.g., GUI agent grounding) currently focus on Chinese-language apps and require extension for broader applicability.
  • Scenario Granularity: While current benchmarks emphasize version, platform, and application dynamics (or, in MT, domain/cultural adaptation), further extension to sequential multi-step workflows, full multilinguality, and accessibility-oriented evaluation remains essential.
  • Continual and Lightweight Adaptation: Future research should prioritize continual learning, less compute-intensive strategies, and modular, plugin-style adaptation for deployment in rapidly-evolving software or industrial environments.

TransBench benchmarks—across their various incarnations—provide rigorous, multi-dimensional frameworks to evaluate and drive progress on transferability, a central requirement for robust, real-world AI deployment in dynamic digital, industrial, and cross-domain settings (Lu et al., 23 May 2025, Li et al., 20 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransBench.