Papers
Topics
Authors
Recent
Search
2000 character limit reached

GUI-Libra: Post-Training for Native GUI Agents

Updated 4 July 2026
  • GUI-Libra is a post-training framework for native GUI agents that maps user instructions, screen observations, and interaction history directly to executable actions.
  • It integrates a curated 81K-step GUI reasoning dataset with action-aware supervised fine-tuning and conservative RL to enhance long-horizon navigation.
  • The framework balances explicit reasoning and precise grounding, yielding significant improvements in both offline and online navigation benchmarks.

Searching arXiv for the primary paper and the potentially confusable earlier “Libra” toolkit paper. GUI-Libra is a post-training framework for native GUI agents: single end-to-end vision-language-action models that map a user instruction, GUI observation, and interaction history directly to executable actions in web and mobile environments. It is designed for long-horizon navigation, where open-source native GUI agents have lagged behind stronger closed-source or multi-module systems, and it combines a curated 81K-step / 9K-trajectory GUI reasoning dataset, action-aware supervised fine-tuning, and conservative RL for partially verifiable step-wise rewards (Yang et al., 25 Feb 2026). The name should not be conflated with the earlier Libra Toolkit for Probabilistic Models, which is a command-line toolkit for discrete probabilistic modeling rather than a graphical user interface system (Lowd et al., 2015).

1. Scope, problem setting, and nomenclature

In the terminology of the 2026 paper, native GUI agents are single end-to-end models that directly produce executable GUI actions from instruction and observation, instead of relying on an external planner or a separate grounding module. The policy is formulated as

πθ(at,ht,ot),\pi_\theta(a_t \mid \ell, h_t, o_t),

where \ell is the task instruction, hth_t the interaction history, and oto_t the current observation. The target use case is high-level multi-step GUI navigation across Android and web interfaces, rather than isolated grounding or one-step clicking (Yang et al., 25 Feb 2026).

The work is motivated by two failures in generic post-training pipelines when transferred to GUI agents. First, standard SFT with CoT reasoning often hurts grounding. Second, step-wise RLVR-style training faces partial verifiability, because multiple actions can be correct at a given state while offline supervision usually verifies only a single demonstrated action. The paper treats these as structural properties of GUI training rather than incidental implementation defects. A plausible implication is that GUI-Libra is less an architectural proposal than a training recipe specialized to the joint requirements of reasoning, grounding, and sequential control.

A recurrent source of confusion is the word “Libra.” The 2015 Libra Toolkit is a collection of command-line programs and shared libraries for learning and inference with discrete probabilistic models, including Bayesian networks, Markov networks, dependency networks, sum-product networks, arithmetic circuits, and mixtures of trees. It is explicitly not presented as a GUI, visualization dashboard, or interactive graphical front-end; it is described as a toolkit of command-line executables plus shared OCaml libraries (Lowd et al., 2015). In that sense, “GUI-Libra” is not a GUI wrapper around the 2015 Libra software, but a separate 2026 line of work on GUI agents.

2. Overall training recipe and system formulation

GUI-Libra is organized as a three-part recipe: a curated reasoning dataset, Action-Aware SFT (ASFT), and conservative RL. The dataset is intended to compensate for the shortage of high-quality, action-aligned reasoning traces. ASFT is intended to preserve explicit reasoning without allowing long chain-of-thought to dominate grounding learning. Conservative RL is intended to stabilize policy optimization when step rewards are only partially verifiable (Yang et al., 25 Feb 2026).

The model input contains a system prompt listing available actions, a user instruction, interaction history, and the current screenshot. The model output uses a structured format with > ... reasoning and <answer> ... </answer> action JSON. The unified action schema contains action_type, action_description, action_target, value, and point_2d. The action space comprises 13 action types: Click, Write, Terminate, Swipe, Scroll, NavigateHome, Answer, Wait, OpenAPP, NavigateBack, KeyboardPress, LongPress, and Select (Yang et al., 25 Feb 2026).

The formulation distinguishes reasoning tokens, action tokens, and grounding tokens. This partition is central because the paper’s diagnosis is that autoregressive training on long CoT responses causes optimization to overemphasize semantic reasoning relative to executable action specification. The resulting degradation is not merely theoretical: the paper reports that longer responses correlate with lower grounding accuracy on ScreenSpot-v2. This suggests that, in GUI settings, token-level supervision must be explicitly structured around execution-critical output rather than treated as a homogeneous language modeling target.

3. GUI-Libra-81K: dataset construction and filtering

The released dataset, GUI-Libra-81K, is constructed from public GUI trajectory corpora: GUI-Odyssey, AMEX, AndroidControl, AitZ, AitW, GUIAct, and MM-Mind2Web. Relative to AGUVIS’s collection, the dataset additionally includes the Chinese subset from GUIAct. Initial cleaning removes incomplete trajectories, trajectories shorter than 3 steps, trajectories longer than 50 steps, and steps with compound actions outside the action space. After this stage, the source pool contains 19K trajectories and 170K steps (Yang et al., 25 Feb 2026).

Most source datasets do not contain rich reasoning traces, so GUI-Libra synthesizes them with prompting. The paper compares GPT-4o, o4-mini, and GPT-4.1, and uses GPT-4.1 because it produces richer visible rationales. The prompt includes observation description, reflection on instruction and history, planning, action-related requirements, and strict output formatting. Importantly, the generator is not forced to repeat the original dataset action exactly; the original annotation is treated as a reference, and a different action may be produced if justified. Coordinates, however, are reused from the original dataset (Yang et al., 25 Feb 2026).

Filtering is aggressive and two-stage. First, Qwen3-VL-8B-Instruct is run for 10 stochastic runs on each input, and a step is discarded if action re-prediction accuracy is below 0.3. Second, coordinate alignment is verified by prompting Qwen3-VL-32B-Instruct to predict a bounding box from screenshot and target description; a sample is kept only if the original point lies inside the predicted box. After filtering, the final SFT set contains 81K steps and 9K trajectories. The RL subset is further balanced by downsampling early steps and mobile-heavy data, producing 40K steps (Yang et al., 25 Feb 2026).

The dataset is predominantly mobile: only 14.3% of the SFT data is web-domain data. The average reasoning length is 210 thought tokens per step, compared with 11 for AndroidControl, 37 for GUI-Net-1M, 56 for AGUVIS Stage 2 L2, and 85 for AGUVIS Stage 2 L3. The action distribution is highly skewed: Click accounts for about 60% of steps, followed by Write, Terminate, and Swipe, while LongPress and Select are rare. This imbalance is one of the paper’s explicit motivations for RL.

Component Value Role
Source pool after cleaning 19K trajectories / 170K steps Pre-filter corpus
Final SFT set 81K steps / 9K trajectories GUI-Libra-81K
Final RL subset 40K steps Balanced RL data

4. Action-aware supervised fine-tuning

ASFT modifies standard SFT in two ways. It mixes reasoning-then-action samples with direct-action samples produced by removing the CoT segment, and it reweights token groups to emphasize action specification and especially grounding coordinates. The loss is

LASFT(θ)=E(xt,ct,at,gt)Dmixlogπθ(ctxt)+αalogπθ(atxt,ct)+αglogπθ(gtxt,ct,at)ct+αaat+αggt,\mathcal{L}_{\text{ASFT}}(\theta) = - \mathbb{E}_{(x_t, c_t, a_t, g_t) \sim D_{\rm mix}} \frac{ \log \pi_\theta(c_t \mid x_t) + \alpha_a \log \pi_\theta(a_t \mid x_t, c_t) + \alpha_g \log \pi_\theta(g_t \mid x_t, c_t, a_t) }{ |c_t| + \alpha_a |a_t| + \alpha_g |g_t| },

where ctc_t denotes reasoning tokens, ata_t action tokens excluding point_2d, and gtg_t the grounding tokens associated with point_2d (Yang et al., 25 Feb 2026).

Default weights are αa=2\alpha_a = 2 and αg=4\alpha_g = 4, except for GUI-Libra-4B where \ell0. In effect, coordinates receive the strongest supervision. The paper also notes useful special cases: \ell1 recovers standard SFT; \ell2 approximates CoT-free SFT; and \ell3, \ell4 approximates grounding-only SFT. This framing makes ASFT a controlled interpolation among standard instruction tuning, direct-action tuning, and grounding-focused tuning.

Ablations show that mixed direct-action data already yields large gains, particularly on online AndroidWorld, and that action-aware weighting adds further improvements. With Qwen2.5-VL-3B, the progression on MM-Mind2Web-v2 Pass@1 / AndroidControl-v2 High Pass@1 / AndroidWorld is 23.4 / 36.4 / 3.5 for the base model, 28.5 / 45.7 / 5.2 for SFT, 30.2 / 45.5 / 11.3 for SFT plus mixed data, and 32.0 / 44.5 / 13.0 for ASFT (Yang et al., 25 Feb 2026).

The grounding analysis is especially significant. Under reasoning mode on ScreenSpot-style evaluation, SFT-3B scores 73.4, SFT 3B + mixed data scores 73.8, ASFT 3B scores 76.2, and GUI-Libra-3B reaches 83.4. At 7B, the corresponding reasoning-mode scores rise from 79.0 for SFT-7B to 81.4 for mixed-data SFT, 83.4 for ASFT, and 89.3 for GUI-Libra-7B. The paper interprets this as evidence that RL nearly removes the remaining reasoning-grounding gap after ASFT.

5. Conservative RL under partial verifiability

The paper defines GUI RL rewards as partially verifiable. For a state \ell5, the dataset provides one demonstrated action \ell6, and the verifier reward is

\ell7

This is partially verifiable because

\ell8

A zero reward may therefore correspond either to a truly incorrect action or to a valid alternative that the offline annotation did not record (Yang et al., 25 Feb 2026).

RL uses GRPO with KL regularization:

\ell9

with hth_t0 and group-normalized advantage hth_t1. Reward shaping combines a small format term and a mostly action-accuracy term:

hth_t2

with

hth_t3

Here hth_t4 verifies action type, hth_t5 if word-level F1 for value exceeds 0.5, and hth_t6 verifies that the predicted point lies inside the demonstrated bounding box (Yang et al., 25 Feb 2026).

A central theoretical claim is that offline step metrics predict online success only when occupancy mismatch and off-demonstration valid-action mass are controlled. The paper defines

hth_t7

hth_t8

and

hth_t9

The main lower bound is

oto_t0

This suggests that improving offline step matching is insufficient unless the policy remains close enough to the reference distribution and does not move excessive probability mass onto uncredited but valid alternatives.

The RL recipe therefore emphasizes a KL trust region. Empirically, moderate KL is optimal: for 3B on AndroidWorld, no KL gives 21.7, KL = 0.001 gives 25.2, KL = 0.01 gives 21.7, and KL = 0.05 gives 20.0. Checkpoint correlations between offline and online metrics strengthen markedly when KL is present: overall Pearson correlation is 0.76; with KL oto_t1 it rises to 0.89 with Spearman 0.83; without KL it drops to Pearson 0.63 and Spearman 0.53 (Yang et al., 25 Feb 2026). The paper treats this as evidence that KL is not merely a generic stabilizer but a mechanism for preserving offline-to-online predictability under partial verifiability.

GUI-Libra also introduces success-adaptive negative gradient scaling. Let

oto_t2

and

oto_t3

Negative advantages are rescaled as

oto_t4

This downweights ambiguous negative updates when group success is low. On GUI-Libra-4B, the method raises AndroidWorld from 39.1 to 42.6 and WebArena-Lite-v2 from 22.2 to 24.4 (Yang et al., 25 Feb 2026).

6. Models, evaluation, and significance

GUI-Libra is trained from Qwen2.5-VL-3B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen3-VL-4B-Instruct, and Qwen3-VL-8B-Instruct, yielding released models from 3B to 8B. The observation modality is primarily the screenshot plus textual context. Qwen2.5-VL-based models use absolute pixel coordinates, whereas Qwen3-VL-based models use normalized coordinates in oto_t5. SFT uses full-parameter fine-tuning with learning rate oto_t6 and effective batch size 256. RL uses learning rate oto_t7, 300 training iterations, rollout batch size 256, global batch size 128, group size 8, clip ratio oto_t8, and KL coefficient 0.005 for 7B and 0.001 for others by default (Yang et al., 25 Feb 2026).

Evaluation spans grounding, offline navigation, and online navigation. Grounding uses ScreenSpot-v2 and ScreenSpot-Pro. Offline navigation uses AndroidControl-v2 and MM-Mind2Web-v2. Online navigation uses AndroidWorld, WebArena-Lite-v2, and Online-Mind2Web. The offline datasets were refined because MM-Mind2Web contains symbolic or non-natural action histories and AndroidControl contains substantial label errors. The online suites cover 115 usable AndroidWorld tasks, 154 WebArena-Lite-v2 tasks, and 300 Online-Mind2Web tasks over 136 websites (Yang et al., 25 Feb 2026).

The main quantitative results show consistent gains in both step-wise accuracy and end-to-end success. On AndroidControl-v2 High Pass@1, GUI-Libra improves from 36.4 to 57.3 at 3B, from 46.5 to 59.3 at 7B, from 49.3 to 62.3 at 4B, and from 54.8 to 64.3 at 8B. On MM-Mind2Web-v2 average Pass@1, the improvements are 23.4 → 42.7 at 3B, 32.5 → 46.5 at 7B, 41.2 → 50.0 at 4B, and 43.8 → 50.5 at 8B (Yang et al., 25 Feb 2026).

Online gains are larger still. On AndroidWorld, the progression is 3.5 → 25.2 at 3B, 7.8 → 29.6 at 7B, 27.0 → 42.6 at 4B, and 30.4 → 42.6 at 8B. On WebArena-Lite-v2 average, the scores become 0.8 → 16.7 at 3B, 4.9 → 22.6 at 7B, 11.9 → 24.4 at 4B, and 15.3 → 26.6 at 8B. On Online-Mind2Web average overall, the changes are 4.8 → 21.3, 15.8 → 25.5, 21.7 → 25.7, and 19.3 → 28.0 respectively (Yang et al., 25 Feb 2026).

These results support three broader claims. First, long-horizon GUI performance can be improved substantially without costly online data collection, provided data curation and post-training are specialized to GUI interaction. Second, explicit reasoning is useful, but only when balanced against executable action supervision; removing CoT from training hurts sharply, especially online. Third, RL for GUI agents is not well described by fully verifiable RLVR assumptions. GUI-Libra’s central conceptual contribution is therefore the reframing of GUI post-training around action-aligned reasoning, grounding-sensitive supervision, and partial verifiability.

The paper also states clear limitations. The training data are limited to existing open-source corpora, web coverage is still small relative to mobile coverage, fully online RL is not studied, and direct grounding supervision can trade off against reasoning and navigation performance. This suggests that GUI-Libra should be understood as a data-efficient post-training framework rather than a complete solution to open GUI-agent learning (Yang et al., 25 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GUI-Libra.