GUI-Libra: Post-Training for Native GUI Agents

Updated 4 July 2026

GUI-Libra is a post-training framework for native GUI agents that maps user instructions, screen observations, and interaction history directly to executable actions.
It integrates a curated 81K-step GUI reasoning dataset with action-aware supervised fine-tuning and conservative RL to enhance long-horizon navigation.
The framework balances explicit reasoning and precise grounding, yielding significant improvements in both offline and online navigation benchmarks.

Searching arXiv for the primary paper and the potentially confusable earlier “Libra” toolkit paper. GUI-Libra is a post-training framework for native GUI agents: single end-to-end vision-language-action models that map a user instruction, GUI observation, and interaction history directly to executable actions in web and mobile environments. It is designed for long-horizon navigation, where open-source native GUI agents have lagged behind stronger closed-source or multi-module systems, and it combines a curated 81K-step / 9K-trajectory GUI reasoning dataset, action-aware supervised fine-tuning, and conservative RL for partially verifiable step-wise rewards (Yang et al., 25 Feb 2026). The name should not be conflated with the earlier Libra Toolkit for Probabilistic Models, which is a command-line toolkit for discrete probabilistic modeling rather than a graphical user interface system (Lowd et al., 2015).

1. Scope, problem setting, and nomenclature

In the terminology of the 2026 paper, native GUI agents are single end-to-end models that directly produce executable GUI actions from instruction and observation, instead of relying on an external planner or a separate grounding module. The policy is formulated as

$\pi_\theta(a_t \mid \ell, h_t, o_t),$

where $\ell$ is the task instruction, $h_t$ the interaction history, and $o_t$ the current observation. The target use case is high-level multi-step GUI navigation across Android and web interfaces, rather than isolated grounding or one-step clicking (Yang et al., 25 Feb 2026).

The work is motivated by two failures in generic post-training pipelines when transferred to GUI agents. First, standard SFT with CoT reasoning often hurts grounding. Second, step-wise RLVR-style training faces partial verifiability, because multiple actions can be correct at a given state while offline supervision usually verifies only a single demonstrated action. The paper treats these as structural properties of GUI training rather than incidental implementation defects. A plausible implication is that GUI-Libra is less an architectural proposal than a training recipe specialized to the joint requirements of reasoning, grounding, and sequential control.

A recurrent source of confusion is the word “Libra.” The 2015 Libra Toolkit is a collection of command-line programs and shared libraries for learning and inference with discrete probabilistic models, including Bayesian networks, Markov networks, dependency networks, sum-product networks, arithmetic circuits, and mixtures of trees. It is explicitly not presented as a GUI, visualization dashboard, or interactive graphical front-end; it is described as a toolkit of command-line executables plus shared OCaml libraries (Lowd et al., 2015). In that sense, “GUI-Libra” is not a GUI wrapper around the 2015 Libra software, but a separate 2026 line of work on GUI agents.

2. Overall training recipe and system formulation

GUI-Libra is organized as a three-part recipe: a curated reasoning dataset, Action-Aware SFT (ASFT), and conservative RL. The dataset is intended to compensate for the shortage of high-quality, action-aligned reasoning traces. ASFT is intended to preserve explicit reasoning without allowing long chain-of-thought to dominate grounding learning. Conservative RL is intended to stabilize policy optimization when step rewards are only partially verifiable (Yang et al., 25 Feb 2026).

The model input contains a system prompt listing available actions, a user instruction, interaction history, and the current screenshot. The model output uses a structured format with > ... reasoning and <answer> ... </answer> action JSON. The unified action schema contains action_type, action_description, action_target, value, and point_2d. The action space comprises 13 action types: Click, Write, Terminate, Swipe, Scroll, NavigateHome, Answer, Wait, OpenAPP, NavigateBack, KeyboardPress, LongPress, and Select (Yang et al., 25 Feb 2026).

The formulation distinguishes reasoning tokens, action tokens, and grounding tokens. This partition is central because the paper’s diagnosis is that autoregressive training on long CoT responses causes optimization to overemphasize semantic reasoning relative to executable action specification. The resulting degradation is not merely theoretical: the paper reports that longer responses correlate with lower grounding accuracy on ScreenSpot-v2. This suggests that, in GUI settings, token-level supervision must be explicitly structured around execution-critical output rather than treated as a homogeneous language modeling target.

3. GUI-Libra-81K: dataset construction and filtering

The released dataset, GUI-Libra-81K, is constructed from public GUI trajectory corpora: GUI-Odyssey, AMEX, AndroidControl, AitZ, AitW, GUIAct, and MM-Mind2Web. Relative to AGUVIS’s collection, the dataset additionally includes the Chinese subset from GUIAct. Initial cleaning removes incomplete trajectories, trajectories shorter than 3 steps, trajectories longer than 50 steps, and steps with compound actions outside the action space. After this stage, the source pool contains 19K trajectories and 170K steps (Yang et al., 25 Feb 2026).

Most source datasets do not contain rich reasoning traces, so GUI-Libra synthesizes them with prompting. The paper compares GPT-4o, o4-mini, and GPT-4.1, and uses GPT-4.1 because it produces richer visible rationales. The prompt includes observation description, reflection on instruction and history, planning, action-related requirements, and strict output formatting. Importantly, the generator is not forced to repeat the original dataset action exactly; the original annotation is treated as a reference, and a different action may be produced if justified. Coordinates, however, are reused from the original dataset (Yang et al., 25 Feb 2026).

Filtering is aggressive and two-stage. First, Qwen3-VL-8B-Instruct is run for 10 stochastic runs on each input, and a step is discarded if action re-prediction accuracy is below 0.3. Second, coordinate alignment is verified by prompting Qwen3-VL-32B-Instruct to predict a bounding box from screenshot and target description; a sample is kept only if the original point lies inside the predicted box. After filtering, the final SFT set contains 81K steps and 9K trajectories. The RL subset is further balanced by downsampling early steps and mobile-heavy data, producing 40K steps (Yang et al., 25 Feb 2026).

The dataset is predominantly mobile: only 14.3% of the SFT data is web-domain data. The average reasoning length is 210 thought tokens per step, compared with 11 for AndroidControl, 37 for GUI-Net-1M, 56 for AGUVIS Stage 2 L2, and 85 for AGUVIS Stage 2 L3. The action distribution is highly skewed: Click accounts for about 60% of steps, followed by Write, Terminate, and Swipe, while LongPress and Select are rare. This imbalance is one of the paper’s explicit motivations for RL.

Component	Value	Role
Source pool after cleaning	19K trajectories / 170K steps	Pre-filter corpus
Final SFT set	81K steps / 9K trajectories	GUI-Libra-81K
Final RL subset	40K steps	Balanced RL data

4. Action-aware supervised fine-tuning

ASFT modifies standard SFT in two ways. It mixes reasoning-then-action samples with direct-action samples produced by removing the CoT segment, and it reweights token groups to emphasize action specification and especially grounding coordinates. The loss is

$\mathcal{L}_{\text{ASFT}}(\theta) = - \mathbb{E}_{(x_t, c_t, a_t, g_t) \sim D_{\rm mix}} \frac{ \log \pi_\theta(c_t \mid x_t) + \alpha_a \log \pi_\theta(a_t \mid x_t, c_t) + \alpha_g \log \pi_\theta(g_t \mid x_t, c_t, a_t) }{ |c_t| + \alpha_a |a_t| + \alpha_g |g_t| },$

where $c_t$ denotes reasoning tokens, $a_t$ action tokens excluding point_2d, and $g_t$ the grounding tokens associated with point_2d (Yang et al., 25 Feb 2026).

Default weights are $\alpha_a = 2$ and $\alpha_g = 4$ , except for GUI-Libra-4B where $\ell$ 0. In effect, coordinates receive the strongest supervision. The paper also notes useful special cases: $\ell$ 1 recovers standard SFT; $\ell$ 2 approximates CoT-free SFT; and $\ell$ 3, $\ell$ 4 approximates grounding-only SFT. This framing makes ASFT a controlled interpolation among standard instruction tuning, direct-action tuning, and grounding-focused tuning.

Ablations show that mixed direct-action data already yields large gains, particularly on online AndroidWorld, and that action-aware weighting adds further improvements. With Qwen2.5-VL-3B, the progression on MM-Mind2Web-v2 Pass@1 / AndroidControl-v2 High Pass@1 / AndroidWorld is 23.4 / 36.4 / 3.5 for the base model, 28.5 / 45.7 / 5.2 for SFT, 30.2 / 45.5 / 11.3 for SFT plus mixed data, and 32.0 / 44.5 / 13.0 for ASFT (Yang et al., 25 Feb 2026).

The grounding analysis is especially significant. Under reasoning mode on ScreenSpot-style evaluation, SFT-3B scores 73.4, SFT 3B + mixed data scores 73.8, ASFT 3B scores 76.2, and GUI-Libra-3B reaches 83.4. At 7B, the corresponding reasoning-mode scores rise from 79.0 for SFT-7B to 81.4 for mixed-data SFT, 83.4 for ASFT, and 89.3 for GUI-Libra-7B. The paper interprets this as evidence that RL nearly removes the remaining reasoning-grounding gap after ASFT.

5. Conservative RL under partial verifiability

The paper defines GUI RL rewards as partially verifiable. For a state $\ell$ 5, the dataset provides one demonstrated action $\ell$ 6, and the verifier reward is

$\ell$ 7

This is partially verifiable because

$\ell$ 8

A zero reward may therefore correspond either to a truly incorrect action or to a valid alternative that the offline annotation did not record (Yang et al., 25 Feb 2026).

RL uses GRPO with KL regularization:

$\ell$ 9

with $h_t$ 0 and group-normalized advantage $h_t$ 1. Reward shaping combines a small format term and a mostly action-accuracy term:

$h_t$ 2

with

$h_t$ 3

Here $h_t$ 4 verifies action type, $h_t$ 5 if word-level F1 for value exceeds 0.5, and $h_t$ 6 verifies that the predicted point lies inside the demonstrated bounding box (Yang et al., 25 Feb 2026).

A central theoretical claim is that offline step metrics predict online success only when occupancy mismatch and off-demonstration valid-action mass are controlled. The paper defines

$h_t$ 7

$h_t$ 8

and

$h_t$ 9

The main lower bound is

$o_t$ 0

This suggests that improving offline step matching is insufficient unless the policy remains close enough to the reference distribution and does not move excessive probability mass onto uncredited but valid alternatives.

The RL recipe therefore emphasizes a KL trust region. Empirically, moderate KL is optimal: for 3B on AndroidWorld, no KL gives 21.7, KL = 0.001 gives 25.2, KL = 0.01 gives 21.7, and KL = 0.05 gives 20.0. Checkpoint correlations between offline and online metrics strengthen markedly when KL is present: overall Pearson correlation is 0.76; with KL $o_t$ 1 it rises to 0.89 with Spearman 0.83; without KL it drops to Pearson 0.63 and Spearman 0.53 (Yang et al., 25 Feb 2026). The paper treats this as evidence that KL is not merely a generic stabilizer but a mechanism for preserving offline-to-online predictability under partial verifiability.

GUI-Libra also introduces success-adaptive negative gradient scaling. Let

$o_t$ 2

and

$o_t$ 3

Negative advantages are rescaled as

$o_t$ 4

This downweights ambiguous negative updates when group success is low. On GUI-Libra-4B, the method raises AndroidWorld from 39.1 to 42.6 and WebArena-Lite-v2 from 22.2 to 24.4 (Yang et al., 25 Feb 2026).

6. Models, evaluation, and significance

GUI-Libra is trained from Qwen2.5-VL-3B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen3-VL-4B-Instruct, and Qwen3-VL-8B-Instruct, yielding released models from 3B to 8B. The observation modality is primarily the screenshot plus textual context. Qwen2.5-VL-based models use absolute pixel coordinates, whereas Qwen3-VL-based models use normalized coordinates in $o_t$ 5. SFT uses full-parameter fine-tuning with learning rate $o_t$ 6 and effective batch size 256. RL uses learning rate $o_t$ 7, 300 training iterations, rollout batch size 256, global batch size 128, group size 8, clip ratio $o_t$ 8, and KL coefficient 0.005 for 7B and 0.001 for others by default (Yang et al., 25 Feb 2026).

Evaluation spans grounding, offline navigation, and online navigation. Grounding uses ScreenSpot-v2 and ScreenSpot-Pro. Offline navigation uses AndroidControl-v2 and MM-Mind2Web-v2. Online navigation uses AndroidWorld, WebArena-Lite-v2, and Online-Mind2Web. The offline datasets were refined because MM-Mind2Web contains symbolic or non-natural action histories and AndroidControl contains substantial label errors. The online suites cover 115 usable AndroidWorld tasks, 154 WebArena-Lite-v2 tasks, and 300 Online-Mind2Web tasks over 136 websites (Yang et al., 25 Feb 2026).

The main quantitative results show consistent gains in both step-wise accuracy and end-to-end success. On AndroidControl-v2 High Pass@1, GUI-Libra improves from 36.4 to 57.3 at 3B, from 46.5 to 59.3 at 7B, from 49.3 to 62.3 at 4B, and from 54.8 to 64.3 at 8B. On MM-Mind2Web-v2 average Pass@1, the improvements are 23.4 → 42.7 at 3B, 32.5 → 46.5 at 7B, 41.2 → 50.0 at 4B, and 43.8 → 50.5 at 8B (Yang et al., 25 Feb 2026).

Online gains are larger still. On AndroidWorld, the progression is 3.5 → 25.2 at 3B, 7.8 → 29.6 at 7B, 27.0 → 42.6 at 4B, and 30.4 → 42.6 at 8B. On WebArena-Lite-v2 average, the scores become 0.8 → 16.7 at 3B, 4.9 → 22.6 at 7B, 11.9 → 24.4 at 4B, and 15.3 → 26.6 at 8B. On Online-Mind2Web average overall, the changes are 4.8 → 21.3, 15.8 → 25.5, 21.7 → 25.7, and 19.3 → 28.0 respectively (Yang et al., 25 Feb 2026).

These results support three broader claims. First, long-horizon GUI performance can be improved substantially without costly online data collection, provided data curation and post-training are specialized to GUI interaction. Second, explicit reasoning is useful, but only when balanced against executable action supervision; removing CoT from training hurts sharply, especially online. Third, RL for GUI agents is not well described by fully verifiable RLVR assumptions. GUI-Libra’s central conceptual contribution is therefore the reframing of GUI post-training around action-aligned reasoning, grounding-sensitive supervision, and partial verifiability.

The paper also states clear limitations. The training data are limited to existing open-source corpora, web coverage is still small relative to mobile coverage, fully online RL is not studied, and direct grounding supervision can trade off against reasoning and navigation performance. This suggests that GUI-Libra should be understood as a data-efficient post-training framework rather than a complete solution to open GUI-agent learning (Yang et al., 25 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL (2026)

The Libra Toolkit for Probabilistic Models (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GUI-Libra.