Action-Aware SFT: Structured Supervision

Updated 4 July 2026

Action-Aware SFT is a supervised fine-tuning paradigm that structures its learning signal around action-critical tokens, grounded entities, and policy-induced states.
It enhances performance by reweighting and regularizing outputs—such as action tokens and grounding coordinates—to better align with executable behaviors in tasks like robotics and GUI control.
The approach serves as a design principle for integrating structured action supervision into model training, improving control capabilities without solely relying on reinforcement learning.

Action-Aware SFT is a family of supervised fine-tuning regimes in which the supervision signal is organized around actions, action-critical tokens, grounded entities, or policy-induced states, rather than treated as uniform next-token imitation. In recent arXiv work, the term spans instruction-conditioned vision-language-action policies that jointly decode category, box, and motor command (Lin et al., 22 Jun 2026), native GUI agents that rebalance reasoning, action, and point_2d grounding tokens (Yang et al., 25 Feb 2026), robotic finetuning that regularizes distributions over feasible action neighborhoods (Niu et al., 2 Apr 2026), and manipulation policies that reweight low-velocity timesteps as physically critical (Peng et al., 13 May 2026). The term does not denote a single canonical algorithm; it denotes a recurring design principle for making supervised post-training respect the structure of control.

1. Scope, terminology, and problem framing

Across recent work, Action-Aware SFT refers to supervised post-training schemes in which the target of learning is not merely a text continuation, but a structured action-bearing object: a discrete motor primitive, a continuous action chunk, a GUI command with coordinates, a streaming moderation decision, or a response-level policy outcome. This suggests that the defining property is not the output modality alone, but whether supervision is shaped by the operational role of the output in downstream behavior.

A compact way to classify the literature is to distinguish what counts as the action-bearing unit and how supervision is modified around it.

Setting	Action-bearing unit	Representative mechanism
VLA robotics/endoscopy	Motor command or action chunk	Joint grounding-and-action prediction
GUI agents	Structured GUI action plus coordinates	Token reweighting over action and grounding spans
Robotic manipulation	Continuous action distribution or timestep loss	FAN regularization or inverse-velocity weighting
Streaming guardrails	Prefix-time SAFE/UNSAFE decision	Prefix-labeled SFT on cumulative prefixes
Instruction following	Full generated response	Response-level RL from supervised reward

The phrase also sits in a terminological field with notable ambiguity. In most of the relevant machine learning literature here, SFT means supervised fine-tuning. However, the symbolic dynamics paper “SFT covers for actions of the first Grigorchuk group” uses SFT to mean subshift of finite type, which is unrelated to post-training of neural models (Grigorchuk et al., 2024). That acronym collision is a frequent source of confusion. A second boundary case is streaming safety classification: “Guard Vector” treats the intermediate action as a prefix-time SAFE/UNSAFE decision rather than a motor command, while “RLSR” treats the full response as the action and replaces token-level imitation with sequence-level reward optimization (Lee et al., 27 Sep 2025); (Wang et al., 16 Oct 2025).

2. Joint prediction of grounding and action

One major branch of Action-Aware SFT makes action prediction part of a structured, grounded output rather than a separate downstream head. In “BiliVLA” (Lin et al., 22 Jun 2026), biliary endoscopic navigation is formalized as an instruction-conditioned visuomotor problem in which each timestep receives an RGB endoscopic image $o_t$ and a stage-specific instruction $I$ , and predicts a tuple $s_t=(c_t,b_t,a_t)$ consisting of target category, bounding box, and action. The action space is discrete: $\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ but each discrete action maps deterministically to a continuous $3$-DoF motor increment

$\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$

The key point is that action tokens are embedded directly in the autoregressively generated target sequence; the paper explicitly states that action labels are part of the same teacher-forced structured output as category and box tokens, rather than an auxiliary head trained after grounding.

This grounded-output pattern appears in a different form in GUI-Libra. There, each step has context $x_t=(\ell,h_t,o_t)$ , with instruction $\ell$ , history $h_t$ , and screenshot $o_t$ , and the model emits reasoning $I$ 0 followed by a structured action $I$ 1. The action schema includes fields such as action_type, action_description, action_target, value, and point_2d, with 13 supported action types including Click, Write, Swipe, NavigateBack, LongPress, and Select (Yang et al., 25 Feb 2026). Action-aware supervision therefore includes both the symbolic action type and the grounding coordinates that make it executable.

Z-1 instantiates a complementary design in continuous-control VLA form. Rather than unifying language and action in one autoregressive token stream, it keeps a modular decomposition: a PaliGemma vision-language backbone produces a multimodal prefix representation, and a flow-based Action Expert consumes that representation plus robot state to generate continuous action chunks (Cao et al., 30 Jun 2026). The policy is written as $I$ 2, where $I$ 3 includes language instruction, visual observation, and proprioceptive state. During SFT, both the VLM and the Action Expert are updated together on RoboCasa demonstrations, so action supervision backpropagates through the full perception-language-to-action stack even though the action space is continuous and chunked rather than discrete and tokenized.

System	Structured output	Action representation
BiliVLA	Category, box, action	11 discrete primitives mapped to 3-DoF motor increments
GUI-Libra	Reasoning plus JSON action	Structured GUI action with `point_2d` grounding
Z-1	Continuous action chunk conditioned on multimodal prefix	Flow-based continuous action chunks

A common implication is that Action-Aware SFT is strongest when the supervised output already contains the variables that operationally define action selection. In BiliVLA, desired motion is defined relative to the grounded target center; in GUI-Libra, executable behavior depends on both action type and coordinates; in Z-1, action generation remains separate from language decoding, but SFT still jointly adapts the perception and action modules.

3. Reweighting, regularizing, and relabeling the supervised signal

A second branch of Action-Aware SFT keeps the same broad input-output interface but changes the loss geometry so that action-relevant parts of the output receive disproportionate supervision. GUI-Libra is a direct example. Its Action-Aware SFT objective decomposes output tokens into reasoning tokens $I$ 4, non-grounding action tokens $I$ 5, and grounding tokens $I$ 6 for point_2d, then applies a normalized weighted loss

$I$ 7

The default setting is $I$ 8 and $I$ 9, with a mixed dataset containing both reasoning-then-action and direct-action variants (Yang et al., 25 Feb 2026). The design target is explicit: preserve long-horizon reasoning while preventing long CoT spans from dominating gradients and degrading grounding.

In robotic manipulation, the same principle appears as action-distribution shaping rather than token reweighting. “Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior” argues that standard VLA SFT inherits a one-exact-action bias from language modeling, even though physical control admits neighborhoods of near-equivalent actions (Niu et al., 2 Apr 2026). It defines the feasible action neighborhood

$s_t=(c_t,b_t,a_t)$ 0

and adds a KL regularizer toward a Gaussian prior centered at the policy mode: $s_t=(c_t,b_t,a_t)$ 1 This is action-aware because the supervisory target is no longer a single demonstrated bin; it is a smooth, unimodal, distance-aware action neighborhood.

AttenA+ applies a different kind of restructuring: it treats action supervision as temporally heterogeneous. Its baseline supervised objective sums losses uniformly across time, then replaces that with inverse-velocity weighting,

$s_t=(c_t,b_t,a_t)$ 2

where

$s_t=(c_t,b_t,a_t)$ 3

and the main experiments use inverse-squared weighting (Peng et al., 13 May 2026). Low-velocity segments are treated as precision-critical, so supervised gradients are reallocated toward contact-rich or alignment-sensitive timesteps.

A fourth variant relabels supervision across time rather than across output spans. “Guard Vector” trains a guard model on cumulative prefixes $s_t=(c_t,b_t,a_t)$ 4 and labels each prefix SAFE or UNSAFE depending on whether harmful content has appeared by that point, then uses binary cross-entropy over the two label tokens <SAFE> and <UNSAFE> (Lee et al., 27 Sep 2025). The action being supervised is the streaming moderation decision, and the action-awareness lies in aligning SFT with the deployment-time decision loop instead of only full-text classification.

Taken together, these methods suggest several distinct axes of action-aware supervision: which output spans matter most, which regions of action space are acceptable, which timesteps are physically critical, and when during a sequence a decision should flip.

4. From offline imitation to on-policy and stage-coupled post-training

Recent work rarely treats Action-Aware SFT as a terminal stage. Instead, it is frequently the initialization or the supervised component of a larger SFT-then-RL or on-policy correction pipeline. In BiliVLA, grounding-enhanced SFT initializes a policy that is later refined by GRPO with reward decomposition

$s_t=(c_t,b_t,a_t)$ 5

so localization, action correctness, and executable structure remain coupled during RL (Lin et al., 22 Jun 2026). Z-1 follows the same pattern at a larger robotic scale: a scene-specific SFT stage on $s_t=(c_t,b_t,a_t)$ 6 RoboCasa demonstrations initializes the policy, and task-wise GRPO then raises average success from $s_t=(c_t,b_t,a_t)$ 7 to $s_t=(c_t,b_t,a_t)$ 8 across 24 tasks (Cao et al., 30 Jun 2026). In both cases, SFT supplies a semantically grounded action prior, while RL is used to improve recovery, consistency, and off-manifold robustness.

VLA-OPD pushes this further by making supervision explicitly depend on the student’s own action-induced states. At iteration $s_t=(c_t,b_t,a_t)$ 9, the student policy $\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 0 rolls out trajectories

$\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 1

and a frozen teacher provides dense action distributions on those visited states. The alignment objective is Reverse-KL,

$\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 2

which the paper argues is bounded, mode-seeking, and more stable than Forward-KL or hard cross-entropy on teacher argmax actions (Zhong et al., 27 Mar 2026). This is a particularly strong formulation of action-aware supervised correction because labels are attached to learner-induced states rather than only expert-distribution states.

WAM-RL shows a different coupling. Its online video SFT refines the world model, not the actor, using successful rollouts: $\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 3 while the actor is updated by RL using reconstruction rewards

$\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 4

The SFT is only indirectly action-aware, since it is applied to the world model rather than an action head, but the paper’s empirical claim is that joint optimization of world model and actor is critical for long-horizon control (Qian et al., 16 Jun 2026).

A related but more data-centric line argues that the assignment of samples to SFT or RL should itself be policy-aware. “Learning What to Learn” partitions data by empirical solve rate $\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 5, uses medium samples plus bridge-transformed hard samples for SFT, restricts RL to prompts with at least one successful rollout, and recycles all-zero-reward failures into new supervised data via Critique Fine-Tuning (He et al., 3 Jun 2026). “Decouple before Integration” reaches a similar conclusion at parameter level: SFT and RLVR update partly complementary but conflicting subspaces, so they may be better trained independently and combined only at test time through sparsified task vectors (Yuan et al., 1 May 2026).

5. Mechanistic and controllability perspectives

Action-Aware SFT also has a mechanistic interpretation. “Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns” models SFT adaptation through a task-specific attention-pattern matrix

$\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 6

and argues that SFT rapidly reconfigures which pre-existing attention heads are used for a downstream task rather than diffusely relearning the entire model (Zhao et al., 2024). Across Llama3-8B, Gemma-7B, and OPT-6.7B, activation distributions become less concentrated after SFT, and complex-task activation changes can be fit as linear combinations of basic-task changes, with $\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 7 for SGSM from GSM8K plus CodeSearchNet and $\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 8 on the Infinity Instruct decomposition. A plausible implication is that action-aware supervision may work partly by routing the model into the head patterns corresponding to sub-actions or prerequisite skills.

“Crafting Reversible SFT Behaviors in LLMs” provides a stronger structural-control view. It represents fine-tuning as

$\mathcal{A} = \{\text{left-up}, \text{left-down}, \text{right-up}, \text{right-down}, \text{left}, \text{right}, \text{up}, \text{down}, \text{forward}, \text{backward}, \text{stop}\},$ 9

then uses Loss-Constrained Dual Descent to compress an SFT-induced behavior into a sparse carrier, and SFT-Eraser to learn a soft prompt that drives carrier activations back toward the base-model state without weight edits (Lin et al., 7 May 2026). The paper is not about action policies in the robotic sense, but it is highly relevant to Action-Aware SFT as a theory of modularization: if a post-trained behavior can be concentrated into a sparse, causally necessary subnetwork, then action-aware supervision can be viewed not only as better labels or losses, but also as an attempt to localize behavior into controllable internal circuits.

These mechanistic works are more suggestive than definitive for action control. The attention-head paper is about tasks rather than explicit motor actions, and the reversible-carrier paper studies behaviors such as safety alignment, fixed responses, and style. Still, they shift the discussion from what supervision is provided to how that supervision is routed and localized inside the model.

6. Empirical profile, limitations, and common misconceptions

Empirically, action-aware variants often improve deployment-relevant behavior more than raw token imitation would predict. BiliVLA reports overall $3$0, $3$1, and $3$2; removing GRPO leaves $3$3 and $3$4, while removing scene-aware modeling leaves $3$5 and $3$6 (Lin et al., 22 Jun 2026). GUI-Libra shows that long CoT harms GUI grounding and that ASFT partially repairs the reasoning-grounding tradeoff: for Qwen2.5-VL-7B, mixed-data SFT in reason mode reaches $3$7 grounding accuracy, while ASFT reaches $3$8 (Yang et al., 25 Feb 2026). FAN-SFT improves ManiSkill SFT from $3$9 to $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 0 in-distribution and from $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 1 to $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 2 average OOD (Niu et al., 2 Apr 2026). AttenA+ raises OpenVLA-OFT on Libero from $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 3 to $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 4 average success and improves Franka real-world average success from $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 5 to $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 6 (Peng et al., 13 May 2026). Guard Vector shows that prefix-aware SFT can nearly eliminate offline-to-streaming mismatch: on the Harmlessness Evaluation Dataset, full-text SFT yields offline F1 $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 7 but streaming F1 $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 8, whereas prefix SFT yields offline $\Delta\boldsymbol{\theta}_t = (\Delta \theta_{x,t}, \Delta \theta_{y,t}, \Delta \theta_{z,t}), \qquad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \Delta\boldsymbol{\theta}(a_t).$ 9 and streaming $x_t=(\ell,h_t,o_t)$ 0 (Lee et al., 27 Sep 2025). RLSR demonstrates that even instruction-following can benefit from action-level treatment of responses: on Qwen-7B trained on INFINITY, SFT reaches AlpacaEval win rate $x_t=(\ell,h_t,o_t)$ 1, RLSR (SB) reaches $x_t=(\ell,h_t,o_t)$ 2, and SFT + RLSR reaches $x_t=(\ell,h_t,o_t)$ 3 (Wang et al., 16 Oct 2025).

The literature also exposes clear limitations. Several papers omit exact SFT losses or optimization details: BiliVLA does not provide a full multitask SFT loss decomposition, and Z-1 does not print the explicit supervised objective for its flow-based action expert (Lin et al., 22 Jun 2026); (Cao et al., 30 Jun 2026). Some action-aware priors are narrow: BiliVLA’s safety prior maps wall-contact frames to a whole-image box and a single backward action; AttenA+ assumes that slow motion indicates criticality, which the paper itself notes may fail for dynamic tasks; FAN-SFT uses a locally unimodal Gaussian prior that may be restrictive for multimodal feasible action sets (Peng et al., 13 May 2026); (Niu et al., 2 Apr 2026). Other methods depend heavily on external teachers or architectural compatibility: VLA-OPD requires a frozen expert teacher on arbitrary visited states, and DoTS requires checkpoints from a shared base model (Zhong et al., 27 Mar 2026); (Yuan et al., 1 May 2026).

Several misconceptions recur. First, Action-Aware SFT does not necessarily mean unifying language and action tokens in one decoder: Z-1 is action-aware even though it uses a separate VLM and Action Expert (Cao et al., 30 Jun 2026). Second, it does not require online interaction: GUI-Libra and Guard Vector are offline SFT methods, but their supervision is action-aware because it targets executable actions or streaming decisions (Yang et al., 25 Feb 2026); (Lee et al., 27 Sep 2025). Third, it is not synonymous with RL or imitation learning; many methods remain purely supervised but modify the loss, labels, or representation to reflect action structure. Finally, not every paper with “SFT” and “action” in the title belongs to this area, because the acronym itself is overloaded, most notably by symbolic dynamics work where SFT means subshift of finite type (Grigorchuk et al., 2024).

In that sense, Action-Aware SFT is best understood as a technical orientation rather than a closed method family. Its central claim is consistent across domains: supervised post-training becomes more effective when the supervision is aligned with the operational semantics of action—what must be grounded, what must be weighted, what states arise from the policy’s own behavior, and what parts of the model must carry the resulting control logic.