Papers
Topics
Authors
Recent
Search
2000 character limit reached

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

Published 1 Jun 2026 in cs.RO, cs.AI, and cs.LG | (2606.02735v1)

Abstract: Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

Summary

  • The paper introduces S2, which refines high-level goals into explicit subtask instructions and allocated visual budgets, leading to robust generalization.
  • The methodology employs hierarchical language relabeling and control-grounded evidence masking to discard distracting context and improve task-specific focus.
  • Empirical evaluations on LIBERO-PRO and real-robot tasks show that S2 substantially outperforms previous VLA methods in resolving local ambiguities.

See Less, Specify More: Visual Evidence Budgets for Generalizable Vision-Language-Action Models

Introduction

The paper "See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs" (2606.02735) addresses generalization challenges in vision-language-action (VLA) models for robotic manipulation. Standard VLA architectures often entangle coarse natural language planning with low-level visuomotor control, overburdening the policy with resolving both high-level intent ambiguities and visually-guided local execution from broad, potentially misleading perceptual context. The paper introduces S2 (See Less, Specify More), a framework that improves generalization by sharpening the executor’s conditioning interface via explicit hierarchical language refinement and a learned, annotation-free visual evidence budget (VEB). The key insight is that executor generalization critically depends on receiving local, task-preserving guidance—both in terms of language and in what visual evidence is retained for action.

S2 Framework

Hierarchical Language Relabeling ("Specify More")

S2 decomposes supervision by relabeling demonstration trajectories into goal-preserving hierarchical instructions: the original coarse goal is maintained, while each local trajectory phase is annotated with refined, concrete subtask instructions. Unlike prior works that either paraphrase or replace the original instruction, this separation enables the executor to maintain task identity while executing unambiguous local behaviors.

Control-Grounded Visual Evidence Budgeting ("See Less")

Standard self-attention mechanisms passively select features but often remain spatially diffuse and unconstrained, leading to nuisance-variable dependency and impaired generalization. S2 instead learns soft visual evidence masks for each camera view (base, wrist) conditioned on the current language context, employing lightweight gate heads without reliance on region, mask, or box annotations. A regularizer constrains the mean mask value to a configurable soft visual evidence budget, encouraging the policy to discard irrelevant scene content and focus only on task-sufficient tokens for control. Figure 1

Figure 1: Control-grounded visual evidence budgeting—visual tokens are gated by predicted masks, learned from task loss and a visual budget constraint, entirely without annotations or external segmentations.

This learned bottleneck is not an architectural efficiency mechanism, but a robustness intervention: it forces the policy to discard spurious or distracting evidence and depend on behavior-relevant context, e.g., specific contact surfaces or target-region cues.

Empirical Evaluation

LIBERO-PRO Suite: Generalization Under Diverse Perturbations

Benchmarked across LIBERO-PRO, S2 consistently outperforms prior VLA methods (OpenVLA-OFT, X-VLA, VLA-Adapter, π0.5\pi_{0.5}) especially on axes that stress local ambiguity resolution (e.g., goal/position swapping, object distractors). Notably, S2 can utilize different planners (Kimi K2.5, GPT-5.4 nano) with the same low-level executor by virtue of its unchanged conditioning interface, demonstrating interface/planner independence. Figure 2

Figure 2

Figure 2: (Left) Ablations showing that S2’s success derives from full hybrid conditioning (preserved goal + local guidance + VEB); (Right) Task success as a function of visual evidence budget, peaking at b=0.2.

Ablations reveal two strong claims: (1) local language alone does not suffice for identity retention, and (2) goal preservation without VEB provides weaker gains. Only their intersection (the full S2 interface) yields robust generalization, contradicting the assumption that dense instructions or monolithic attention alone are optimal.

Real-Robot Experiments (TX-G2, HSR)

The S2 executor is evaluated on a bimanual TX-G2 (AgiBot G2 variant) and mobile-manipulator HSR across 8 tasks (cutlery, bowl, clothes, dish, coffee, bottles, box, mug). All models are comparably trained and evaluated under identical real-robot and closed-loop conditions.

S2 raises TX-G2 mean subtask success from 54.2% (π₀.₅) to 79.0%, and further dominates on HSR, especially in manipulation+locomotion scenarios. Competing approaches (e.g., VLA-Adapter) largely fail, highlighting the criticality of executor interface design. Figure 3

Figure 3: Successful S2 rollout in a cluttered TX-G2 scene, maintaining ordered manipulation despite distractors and online perturbations.

S2 demonstrates robust target identification and completion-order preservation in out-of-distribution settings with heavy clutter and scene interventions, a direct consequence of its selective evidence grounding. Figure 4

Figure 4: The S2 visual evidence mask attends to manipulated objects, contact regions, and task-specific context, outperforming the diffuse native attention map.

Qualitative mask comparisons show VEB’s attention aligned tightly with behavioral requirements, whereas conventional self-attention allocates weight broadly and often to irrelevant regions. Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: TX-G2 evaluation trajectories: S2 enables accurate semantic grounding and precise placement, highlighting arm selection under bimanual constraints.

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: HSR evaluation trajectories: S2 robustly coordinates locomotion with manipulation, leveraging local context in the learned visual mask.

Implementation and Optimization

S2 is modular and largely backbone-agnostic. In this work, it is implemented atop the π0.5\pi_{0.5} policy, where base and wrist image tokens are gated at the input to the action-prediction module. The visual budget regularization is annealed, gate collapse is avoided by a nonzero mask floor and a coupled ungated path, and both masked and unmasked loss terms are used to drive task-relevant sparsification.

The default visual evidence budget (ρ\rho) is set to 0.2 for both views, with robustness of results against moderate budget sweeps empirically verified. Subtask-phase alignment is handled by hard frame decomposition, with the main failure mode being planner-side misalignment of subtask boundaries, not executor ambiguity.

Practical and Theoretical Implications

The results substantiate a paradigm shift for generalizable VLA design:

  • Executor Conditioning as the Locus of Generalization: Rather than focus solely on end-to-end architectural scaling or data expansion, S2 demonstrates that the executor’s interface—both in local language specificity and selective perceptual input—is the bottleneck for robust transfer and avoidance of supervision aliasing.
  • Native Attention Is Inadequate for Robotics: Relying on unconstrained attention leads to degenerate dependencies that fail under distribution shift. Explicit evidence budgeting, learned under a control objective, targets actionable information.
  • Annotation-Free Selective Perception: S2’s VEB does not require any manual or external visual annotations, segmentations, or region proposals, facilitating rapid adaptation to new environments and task specifications.
  • Planner-Executor Modularity: The compatibility with off-the-shelf VLM planners (by subtask-level language relabeling) supports future research toward dynamically composed robotic systems, decoupling high-level planning from low-level control.

Future Directions

Prospective research may explore:

  • Mixture-of-Experts and Multi-View Evidence Budgets: Dynamic, context-sensitive allocation of multiple visual evidence budgets could enable further compositional robustness.
  • Closed-Loop Subtask Boundary Inference: Integrating online subtask phase detection would remove the need for a strong planner and further mitigate alignment noise.
  • Transfer to Multimodal and Non-Visual Inputs: Extending the principles of executor conditioning to tactile or proprioceptive cues, aligning evidence budgeting across modalities.

Conclusion

S2 systematically demonstrates that robust, generalizable vision-language-action policies emerge from an executor trained on local, disambiguated language and restricted, learned visual evidence budget. This reframes VLA system design toward cleaner modular interfaces, strengthening generalization without the need for dense, ambiguous context or annotation-intensive supervision. The empirical results on benchmark suites and real robots highlight both the practical utility of this principle and its potential for driving future advances in scalable, robust robot learning.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper is about teaching robots to follow instructions better in the real world. The authors focus on a kind of robot brain that uses vision (seeing), language (reading instructions), and action (moving) together—often called a VLA model. Their main idea, called S2 (See Less, Specify More), is simple:

  • Specify More: give the robot clearer, step-by-step guidance about what to do right now.
  • See Less: show the robot only the parts of the camera view that matter for the current step.

By doing both, the robot stops getting confused by distractions (like extra objects) and vague instructions. It learns to act based on the right clues and clearer directions.

What questions did the researchers ask?

The paper explores three easy-to-understand questions:

  • Do vague instructions (like “put the cup on the table”) make robots guess too much about the exact moves they should take?
  • If we add helpful details to the instructions, is it better to replace the original goal entirely, or keep the original goal and add focused step-by-step hints?
  • Can we train the robot to pay attention only to the small parts of the image that matter, even without telling it exactly which pixels are important?

How did they do it?

Think of the robot team as having two roles:

  • A planner (like a coach) that reads the high-level goal and gives local, step-by-step guidance.
  • An executor (like a driver) that actually moves the robot arm(s) and completes the task.

S2 improves the “conversation” between the coach and the driver:

  1. Specify More: clearer, local instructions
  • Instead of only using a broad goal (for example, “place the mug in the rack”), the system adds precise substeps that match what the demonstration shows (like “grasp the mug by the handle from the right” then “place it on the top left slot”).
  • Importantly, the original goal is kept. The robot sees both the main goal and the current detailed step. This keeps the task identity clear while removing ambiguity about “how” to execute the current phase.
  1. See Less: limit visual clutter with a “visual evidence budget”
  • The robot’s camera sees a lot, but only a small portion is relevant (like the mug handle, the rack slot, or the gripper’s contact area).
  • The authors set a “budget,” which is like saying: “You can only pay attention to a small fraction of the image—choose wisely.”
  • The robot learns for itself which parts of the image help it succeed (no human-drawn boxes or masks). It figures this out during training by noticing which visual bits actually lead to good actions.

Technical idea in everyday words:

  • Picture the robot’s camera view as a poster made of many tiles.
  • The robot gets a limit on how many tiles it can focus on (the “budget”).
  • While learning, it tries different choices. If it still completes the step successfully, those chosen tiles were probably the right evidence. Over time, it learns to highlight the tiles that matter—like the target object, the contact point, or the destination spot—and to ignore the rest.

Training and deployment:

  • During training, the robot learns under this cleaner interface: high-level goal + current subtask + limited, focused vision.
  • At test time, an off-the-shelf LLM (the “coach”) generates the local step-by-step hints, and the robot (the “driver”) executes them with its learned focus.

What did they find, and why does it matter?

Main results, explained simply:

  • Clearer steps beat vague goals: Robots trained with both the original goal and precise, local instructions performed better than those given only vague goals or only replacement instructions. Keeping the main goal while adding detailed steps avoids confusion and preserves the true task.
  • Seeing less can be better: Limiting what the robot attends to made it more robust to clutter, distractions, and changes in appearance. It stopped relying on unhelpful background details that don’t generalize.
  • Strong gains across tasks: On eight real-robot tasks (using TX-G2 and HSR robots), the average success on subtasks jumped from about 54% to 79%. That’s a big improvement for practical, real-world manipulation.
  • Works with common planners: The approach doesn’t lock you into a specific “coach” model. Different off-the-shelf language planners still worked well, suggesting the method is flexible.

Why it matters:

  • Real homes and workplaces are messy and unpredictable. Robots that only succeed in clean lab setups aren’t very useful.
  • Teaching robots to follow clear steps and focus on the right visual clues makes them more reliable in the wild.

What’s the bigger impact?

  • Better generalization: Robots become less sensitive to small changes—like different object colors, extra items on the table, or slight camera shifts.
  • Safer and simpler training: The robot learns what to focus on without extra labels or human-drawn regions, reducing the need for time-consuming annotations.
  • Modular design: You can swap in different planners (LLMs acting as coaches) without retraining the executor from scratch, making systems easier to upgrade.
  • Path to real-world helpers: As robots get better at ignoring distractions and following precise guidance, they become more trustworthy for everyday chores—like sorting laundry, placing dishes, or handling groceries.

Key terms in plain language

  • VLA model: A robot system that connects what it sees (vision), what it’s told (language), and what it does (action).
  • Planner-executor split: The “coach” decides the steps; the “driver” executes them smoothly.
  • Specify More: Add precise, right-now instructions to make the current step unambiguous, while keeping the original goal.
  • See Less (visual evidence budget): Give the robot a limit on how much of the image it can rely on, so it learns to focus on the truly important bits.

In short, the paper shows that robots do better when they receive clearer step-by-step guidance and are trained to pay attention only to the important parts of what they see. This combination makes them more reliable and adaptable in real-world settings.

Knowledge Gaps

The paper leaves the following concrete gaps, limitations, and open questions for future work:

  • Planner error handling and detection: the executor assumes subtask guidance is correct; no mechanism exists to detect, reject, or recover from incorrect or stale planner outputs. Evaluate sensitivity to planner mistakes and develop verification/replanning loops.
  • Subtask boundary noise: training uses hard, frame-aligned subtask spans; boundary errors become supervision noise. Explore soft/latent alignment (e.g., HMMs, CTC-style losses) or boundary-uncertainty modeling.
  • Fixed, global evidence budgets: budgets ρb, ρw are constant across tasks, subtasks, and time. Investigate adaptive or learned budgets conditioned on task, subtask, scene complexity, or model uncertainty.
  • Mean-based budget regularizer: penalizing only the mean gate value can be satisfied by diffuse low gates. Compare to top-K constraints, L1 sparsity, hard-concrete gates, differentiable token selection, or KL-to-target-sparsity objectives.
  • Temporal evidence selection: masks are per-frame; temporal sufficiency is not modeled. Study spatiotemporal evidence budgeting over video tokens and memory-aware gates.
  • Cross-modal budgeting: only visual tokens are gated; proprioception and other sensors are untouched. Test cross-modal evidence budgeting and its impact on robustness.
  • Interpretability and causal validation: mask quality is shown qualitatively; no causal tests. Perform counterfactual patch swaps, targeted occlusions, and do-interventions to quantify causal alignment of selected evidence.
  • Efficiency and compute: gating does not reduce compute (tokens are scaled, not pruned). Measure latency/throughput, and explore compute-aware token pruning that preserves performance.
  • Planner diversity and robustness: only two planners (Kimi K2.5, GPT-5.4 nano) are evaluated. Systematically test a broader set of planners and styles, and quantify sensitivity to verbosity, formatting, and error rates.
  • Reliability of VLM relabeling: the QC process, error rates, and inter-rater agreement for trajectory/subtask relabeling are unspecified. Quantify relabeling noise and release audit tools or gold subsets.
  • Phase-vocabulary rigidity: a fixed approach/engage/execute/disengage/transit schema may not fit all tasks. Explore learned subtask grammars and adaptive granularity without hand-fixed phases.
  • Safety considerations: evidence budgeting could suppress safety-critical cues (e.g., human hands). Add “always-keep” channels or safety priors, and evaluate in human-in-the-loop scenarios.
  • Failure mode taxonomy: beyond aggregate metrics, the paper lacks detailed failure categorization. Provide per-perturbation error analyses to target the next set of interface or gating changes.
  • Cross-embodiment and sensor setups: results are on TX-G2 and HSR with base/wrist RGB. Test broader embodiments, camera placements, depth/TiF sensors, and sim-to-real transfer.
  • Sample efficiency: it is unclear if S2 reduces demonstration requirements. Run learning-curve studies to measure data efficiency gains from the interface.
  • RL fine-tuning: only supervised flow-matching is used. Investigate RL/offline RL to optimize gates and actions jointly under success/reward signals.
  • Robustness benchmarks: clutter and perturbations are demonstrated but not standardized. Build systematic robustness suites (lighting, distractors, moving obstacles, camera shifts) with quantitative mask/effectiveness metrics.
  • Planner–executor co-adaptation: the interface is fixed; no co-training is attempted. Study iterative alignment (schema constraints, self-check prompts, feedback tokens) to reduce mismatches.
  • Uncertainty and confidence use: gate scores and instruction-following uncertainty are not calibrated or used. Calibrate them and trigger replans, higher budgets, or sensor shifts when confidence is low.
  • Quantitative mask evaluation: no metricized mask quality is provided. Annotate a small subset with relevance labels or use weak labels to compute precision/recall/IoU vs. ground-truth-relevant regions.
  • Selective perception baselines: comparisons to token-selection/pruning methods (TokenLearner, DynamicViT, ToMe) or object-centric baselines are missing. Add apples-to-apples ablations.
  • End-to-end performance: main results emphasize subtask success; full E2E success, time-to-completion, and compounding-error analyses are deferred. Report and analyze them centrally.
  • Compute/memory footprint: additional gate heads and dual-path losses may add overhead. Provide detailed training/inference cost profiling and scaling behavior.
  • Theoretical underpinnings: claims about reduced spurious correlations lack formal analysis. Develop information-theoretic or generalization bounds for evidence budgeting under distribution shift.
  • Gate placement/design: only single-layer, token-wise gates are studied. Evaluate gating at different layers, multi-layer gating, channel-wise gates, and per-attention-head gating.
  • Budget scheduling: the temperature and annealing schedules are heuristic. Study schedule sensitivity, curriculum strategies, and constrained optimization enforcing exact budgets.
  • Noise in refined instructions: robustness to mis-specified or contradictory s_i is untested. Train with synthetic noise and evaluate recovery strategies (e.g., instruction smoothing or majority voting).
  • Bimanual decisions: arm selection improvements are observed but not formalized. Introduce explicit arm-selection tokens or planner-side commitments and measure their effect.
  • Conflict resolution in language: behavior when g and s_i disagree is unspecified. Define conflict-resolution policies (e.g., prioritize g, request replan) and evaluate outcomes.
  • Long-horizon scaling: cadence for refreshing s_i, memory limits, and drift over many subtasks are not studied. Analyze instruction update frequency, memory mechanisms, and cumulative error.
  • Reproducibility and release: clarity on releasing code, relabeled language, masks, and prompts is missing. Provide artifacts and scripts to reproduce relabeling, training, and evaluation.

Practical Applications

Immediate Applications

Below are specific, deployable ways to use the paper’s “See Less, Specify More (S2)” framework today, grounded in the paper’s real-robot results (TX-G2, HSR) and benchmark gains. Each bullet links to sectors, outlines plausible tools/workflows, and notes key dependencies or assumptions.

  • Manufacturing and assembly robotics — Robust pick-and-place, kitting, and light assembly under clutter and appearance shifts by pairing a high-level VLM planner with an S2-trained executor that follows goal-preserving local guidance and learned visual evidence budgets (VEB).
    • Tools/Workflows: S2 fine-tuning wrapper for existing VLAs (e.g., pi0.5/OpenVLA), trajectory/subtask relabeling pipeline, planner prompt templates, budget-tuning and mask visualization.
    • Dependencies/Assumptions: Dual-view cameras (base+wrist) or comparable perception; a competent off-the-shelf VLM planner for local instruction generation; task-specific prompting; basic robot safety protocols.
  • Warehouse and logistics automation — Bin picking, sorting, and packing with improved generalization to new SKUs and shelf layouts by shrinking executor ambiguity (disambiguated subtasks) and suppressing irrelevant visual cues (evidence budgeting).
    • Tools/Workflows: “Planner shim” that converts WMS tasks into refined subtask prompts; quick in-context planner updates for new product lines; VEB dashboards for QA.
    • Dependencies/Assumptions: Clean subtask taxonomies; reliable subtask timing/alignment; planner accuracy for local guidance; periodic relabeling of demos on new SKUs.
  • Service and home-assist robots — More reliable household tasks (e.g., decluttering, dish/basket sorting, putting away items) under heavy distractors, leveraging S2’s improved subtask success and clutter robustness.
    • Tools/Workflows: Mobile manipulator with S2 executor; phone app to set high-level goals; in-context prompts to generate subtasks; on-device gate heads for VEB.
    • Dependencies/Assumptions: Safety in proximity to humans/pets; privacy safeguards; basic home scanning or initial keyframe capture for planner grounding.
  • Hospital operations and logistics — Fetch-deliver, restocking, and room prep where task identity is preserved (hygiene-critical) and local guidance removes ambiguity (route, hand choice, placement strategy) while VEB limits reliance on irrelevant scene factors.
    • Tools/Workflows: “Nurse-assist” mode mapping clinical orders to subtask prompts; S2 executor tuned to hospital carts/shelves; evidence-mask auditing for safety QA.
    • Dependencies/Assumptions: Institutional approvals; sterilization workflows; predictable subtask vocabularies; planner that avoids hallucinating unsupported clinical steps.
  • Bimanual manipulation — Tasks requiring arm selection or handoff (e.g., basket sorting, folding) benefit from S2’s goal-preserving local guidance and VEB to focus on contact-local cues and destination context.
    • Tools/Workflows: Bimanual S2 training on TX-G2-like platforms; per-view budgets for wrist/base cameras; relabeling that encodes which arm executes each subtask.
    • Dependencies/Assumptions: High-quality demonstrations with clear arm usage; tuned budget hyperparameters; synchronized end-effector sensing.
  • Field robotics (agriculture, outdoor maintenance) — Harvesting, pruning, or pick-and-place in visually variable environments by reducing executor’s dependence on nuisance appearance and emphasizing task-sufficient evidence.
    • Tools/Workflows: Seasonal re-prompting of planner; lightweight gate heads for edge inference; small-budget VEB to mitigate background variation.
    • Dependencies/Assumptions: Weather-resistant sensors; sufficient initial demos; periodic planner validation in novel scenes.
  • Retail restocking and store operations — Shelf-facing manipulation amid dense distractors; VEB helps ignore signage/branding while following subtask-level instructions (e.g., exact placement region).
    • Tools/Workflows: Subtask schema aligned with planogram; evidence mask visualizer for compliance audits; rapid planner prompt updates for promotions.
    • Dependencies/Assumptions: Consistent shelf geometry; reliable image capture of target zones; minimal planner errors.
  • Software and GUI automation (RPA) — Apply the “See Less” and “Specify More” ideas to screen-based agents: refine coarse goals into step-level actions and gate visual tokens to relevant UI regions to improve robustness across themes/skins.
    • Tools/Workflows: Screenshot token-gating module; stepwise prompts that preserve task identity; UI element-focused evidence budgets.
    • Dependencies/Assumptions: Adapt VEB to GUI tokens; sufficient labeled demonstrations; reliable OCR/UI detection; data privacy compliance.
  • Academic research and robotics education — Better datasets and benchmarks via automatic goal-preserving trajectory/subtask relabeling; ablation studies on generalization vs. visual budgeting; reproducible planner-executor modularity.
    • Tools/Workflows: Open-source relabeling scripts; shared prompt libraries; gate schedules and budget sweeps; evaluation on LIBERO/CALVIN/LIBERO-PRO.
    • Dependencies/Assumptions: Access to an off-the-shelf VLM; compute for fine-tuning; careful QC of relabeled language to avoid drift.
  • Policy debugging, QA, and interpretability — Use learned evidence masks to audit what the robot “looked at” when acting, improving troubleshooting and safety case documentation.
    • Tools/Workflows: “Evidence budget dashboard” showing mask overlays vs. native attention; failure replay filtered by subtask.
    • Dependencies/Assumptions: Logging infrastructure; human-in-the-loop diagnostics; governance processes for incident review.
  • Edge deployment efficiency (secondary benefit) — Although VEB’s intent is robustness, learned token suppression can reduce effective vision compute in some pipelines without bespoke token-pruning logic.
    • Tools/Workflows: Token throughput monitors; budget-aware batching; fallback to ungated path if performance drops.
    • Dependencies/Assumptions: Budget carefully tuned per scene/view; performance guardrails; on-device profiling.
  • Cross-planner interoperability — Standardize on S2’s goal-preserving local language interface so different planners (e.g., Kimi K2.5 vs. GPT-5.4 nano) can be swapped without retraining the executor.
    • Tools/Workflows: “Planner adapters” that normalize subtask schema; regression tests across planners; prompt-versioning.
    • Dependencies/Assumptions: Stable subtask taxonomy; consistent in-context examples; interface governance to prevent drift.

Long-Term Applications

These applications require additional research, scaling, or standardization—e.g., broader datasets, stronger safety cases, or productization of S2 tooling and interfaces.

  • Cross-embodiment generalist executors — Train large, backbone-agnostic S2 executors across many robots and domains, improving out-of-distribution generalization via standardized interfaces.
    • Tools/Products: Multi-robot datasets with trajectory/subtask relabels; S2 APIs; cloud-scale training.
    • Dependencies/Assumptions: Significant data/compute; shared subtask vocabularies across embodiments; robust planner generalization.
  • Regulatory and standards frameworks — Define interface and interpretability standards for task-identity preservation and visual evidence budgets in safety-critical robotics.
    • Tools/Products: Compliance test suites; “VEB audit” certification; standardized mask logging formats.
    • Dependencies/Assumptions: Industry and regulator consensus; evidence that budgets correlate with risk reduction.
  • Autonomy with “no-code” task setup — Factory and warehouse lines configured via language + local guidance (planner) with S2 executors, minimizing reprogramming for new SKUs or fixtures.
    • Tools/Products: Task setup studios; schema-driven prompt packs; rapid relabeling wizards.
    • Dependencies/Assumptions: High planner reliability; robust subtask alignment under layout changes; fallback supervisors.
  • Surgical and interventional robotics microtasks — Use S2-like interfaces (tight task identity and conservative evidence budgets) for primitives such as passing tools, suture handling, or supply handoffs.
    • Tools/Products: Domain-trained VLM planners with surgical lexicons; validated VEB tuned for sterile fields; sim-to-real pipelines.
    • Dependencies/Assumptions: Extensive clinical validation; liability and safety frameworks; high-fidelity perception.
  • Assistive home-care robots for long-horizon chores — Reliable laundry/kitchen workflows where planners produce ordered subtask sequences and VEB sustains focus despite clutter and interruptions.
    • Tools/Products: Home-user demonstration capture with automatic relabeling; customization UIs; privacy-first on-device inference.
    • Dependencies/Assumptions: Household diversity in objects/layouts; robust failure recovery; social acceptability.
  • Multi-robot teams — Shared subtask language and budgeted perception to coordinate division of labor (e.g., mobile + manipulator) via a common planner.
    • Tools/Products: Inter-robot planning protocols; cross-agent subtask taxonomies; evidence-sharing policies.
    • Dependencies/Assumptions: Reliable communication; time synchronization; conflict resolution mechanisms.
  • AR-guided human-robot collaboration — Human operators provide disambiguating local guidance in AR that plugs into the S2 interface; VEB locks onto operator-indicated ROIs for safer collaboration.
    • Tools/Products: AR UIs that emit structured subtask instructions; ROI-to-gate integration; mixed-reality validation tools.
    • Dependencies/Assumptions: Accurate calibration; low-latency tracking; ergonomic UX.
  • Extreme environments (space, nuclear, subsea) — Robust manipulation under severe shifts using planners for detailed guidance and strict evidence budgets to avoid spurious correlations.
    • Tools/Products: Hardened sensors; long-range planner links; verified mask behavior under harsh conditions.
    • Dependencies/Assumptions: Limited bandwidth/latency constraints; small data regimes; extensive simulation.
  • Education and workforce development — Curricula and capstone projects built on modular planner-executor designs with relabeling and VEB as first-class concepts.
    • Tools/Products: Teaching kits; standardized datasets with subtask spans; cloud notebooks demonstrating budget sweeps.
    • Dependencies/Assumptions: Open tooling and licenses; institutional adoption.
  • Enterprise software agents for complex workflows — Extend S2 concepts to document- and screen-heavy tasks (finance back-office, IT ops), using local step prompts and “evidence budgets” over UI/text regions for reliability and auditability.
    • Tools/Products: Enterprise agent SDKs with gating over screenshots/PDF tokens; compliance logs of “what evidence was used.”
    • Dependencies/Assumptions: Robust UI detection; privacy/security controls; domain-tuned planners.
  • Tooling ecosystem and standards — “S2-compliant” planner-executor SDKs, mask visualizers, budget tuners, dataset relabelers, and benchmarking suites analogous to LIBERO-PRO for real-world sites.
    • Tools/Products: Open standards for subtask schemas and mask telemetry; plug-ins for major VLA backbones.
    • Dependencies/Assumptions: Community buy-in; maintenance and versioning.
  • Formal verification of behaviors — Use explicit subtask specs and bounded evidence to verify slices of behavior and derive safety envelopes.
    • Tools/Products: Subtask-level model-checking; counterexample-guided mask audits; certifiable execution traces.
    • Dependencies/Assumptions: Formal models capturing planner-executor dynamics; tractable abstractions for continuous control.

Cross-cutting assumptions that impact feasibility

  • Planner reliability is a bottleneck: the S2 executor follows the subtask it’s given and does not “undo” planner errors; planner quality and in-context prompting are critical.
  • Data and relabeling quality matter: trajectory/subtask relabeling must preserve task identity while disambiguating execution modes; boundary noise can degrade supervision.
  • Perception setup: Most results assume at least two camera views (base and wrist) plus proprioception; other setups may need adaptation.
  • Hyperparameter sensitivity: Visual evidence budgets (e.g., ~0.2 in paper) and gate schedules require tuning per task/view; mis-tuning can hurt performance.
  • Backbone agnosticism is claimed but demonstrated with pi0.5; porting to other VLAs should be possible but may require engineering.
  • Safety and compliance: Sectors like healthcare and manufacturing may require rigorous validation and audit trails (which S2’s masks can support, but processes must be built).

Glossary

  • Action horizon: the number of future actions the policy predicts or considers in a chunk. "and hh is the action horizon."
  • Action multimodality: the presence of multiple valid ways to perform a task, leading to diverse action trajectories for the same instruction. "this burden mismatch appears as avoidable ambiguity and apparent action multimodality."
  • Annealed soft-gating schedule: a training schedule that gradually sharpens gating decisions by lowering the gating temperature over time. "an annealed soft-gating schedule that keeps the gate smoother early in training before sharpening later."
  • Appearance shift: changes in visual appearance (e.g., lighting, textures) between training and deployment that can harm performance. "under distractors, appearance shift, embodiment-specific noise, and tight inference constraints."
  • Base view: a camera viewpoint providing broader scene context, often mounted on the robot base. "the base and wrist views."
  • Bimanual: involving or requiring two arms or manipulators. "Because TX-G2 is bimanual, the policy must also infer which arm should execute the current behavior."
  • Budget regularizer: a loss term that penalizes deviations from a target amount of retained visual evidence. "then trains the gated representation with a task loss and a budget regularizer, without any region, box, or mask annotation."
  • Contact dynamics: physical interactions during contact (e.g., friction, compliance) that affect execution outcomes. "arising from embodiment, contact dynamics, or control noise."
  • Control objective: the task-driven training signal used to learn actions, guiding what visual evidence is sufficient. "the executor learns directly from the control objective which evidence is sufficient for successful execution under the current subtask."
  • Distribution shift: a mismatch between training and test data distributions that can degrade generalization. "benchmarks such as LIBERO and CALVIN still expose brittle generalization under distribution shift"
  • Distractors: task-irrelevant objects or features in the scene that can mislead perception or control. "under distractors, appearance shifts, and semantically similar tasks"
  • Elementwise product: a feature interaction computed by multiplying corresponding elements of two vectors. "the pooled language summary, and their elementwise product."
  • End-effector: the tool or gripper at the end of a robot arm that interacts with objects. "behavior-relevant object, end-effector, and destination context"
  • Evidence bottleneck: an architectural constraint limiting the amount of visual evidence available to the policy to encourage focus on task-relevant cues. "introduces an explicit, control-grounded evidence bottleneck"
  • Flow-matching objective: a training objective that learns a velocity field to transform noise into target actions over a synthetic time variable. "the policy predicts an action chunk using a flow-matching objective."
  • Gate collapse: a failure mode where gating shuts off most inputs or keeps all inputs uniformly, preventing meaningful selection. "Early gate collapse or trivial all-keep behavior is avoided"
  • Gate floor: a minimum gating value ensuring tokens retain a small residual contribution even when largely suppressed. "We then apply a nonzero gate floor via"
  • Gate head: a lightweight network that outputs token-wise keep/suppress values for visual evidence. "These masks are produced by lightweight learned gate heads inside the executor."
  • Gating network: the model component that computes gating logits or scores for tokens, often conditioned on language and vision. "where hvh_v is a small learned gating network for that view."
  • In-context learning: a method where a model adapts behavior based on examples or instructions in the prompt without gradient updates. "remains compatible with off-the-shelf VLM planners through in-context learning."
  • Inference constraints: resource or latency limits during deployment that restrict model complexity or throughput. "tight inference constraints."
  • Latent planner-to-policy interfaces: internal or hidden representations that connect planning modules to policies without explicit symbolic commands. "reasoning-augmented architectures, latent planner-to-policy interfaces, or RL post-training"
  • Low-level continuous actions: fine-grained control signals (e.g., velocities, torques) output at high frequency for robot actuators. "output precise low-level continuous actions"
  • Native attention: the standard attention mechanism of the backbone network before any additional gating or budgeting is applied. "Unlike native attention, See Less imposes an explicit visual evidence budget"
  • Nuisance dependence: reliance on task-irrelevant visual correlations that can hurt robustness under shift. "this bottleneck is introduced not for efficiency but to reduce nuisance dependence and improve generalization"
  • Object-centric methods: approaches that represent or reason about scenes in terms of discrete objects to improve control and robustness. "attention-based and object-centric methods improve robustness"
  • Off-the-shelf VLM: a pretrained vision-LLM used without task-specific fine-tuning for planning or guidance. "remains compatible with off-the-shelf VLM planners"
  • Out-of-distribution (OOD): data points that differ significantly from the training distribution (e.g., novel placements). "8 near-ID and 2 OOD placements"
  • Planner-executor: a modular architecture splitting high-level decision-making (planner) from low-level control (executor). "S2 defines the executor-conditioning interface in a modular planner-executor VLA system."
  • Proprioceptive state: internal robot measurements such as joint positions or velocities. "xtx_t denotes proprioceptive state."
  • Region proposals: candidate spatial regions (e.g., boxes) hypothesized to contain relevant content, often used in detection pipelines. "without external region proposals, mask or box annotations"
  • Robustness: the ability of a model to maintain performance under perturbations or shifts. "provides complementary robustness gains"
  • Saliency map: a visualization or estimate of input regions deemed important by a model. "rather than a generic saliency map."
  • Soft keep value: a continuous gate (between 0 and 1) indicating how much of a token’s information to retain. "predicts a soft keep value"
  • State-specific local guidance: context-dependent instructions tailored to the current scene state for disambiguating execution. "rewrites the same high-level instruction into state-specific local guidance"
  • Subtask: a phase or segment of a larger task with its own localized instruction and time span. "Each subtask is intended to be directly executable by the low-level policy"
  • Supervision aliasing: ambiguity where the same supervision signal corresponds to multiple valid behaviors, confusing learning. "coarse instructions induce avoidable supervision aliasing"
  • Temperature-scaled sigmoid gating: applying a sigmoid with a temperature parameter to control gate sharpness. "followed by temperature-scaled sigmoid gating"
  • Token pruning: removing a subset of tokens to reduce redundancy or computation while preserving performance. "token pruning to reduce redundancy or improve efficiency"
  • Token selection: choosing a subset of tokens deemed most informative for the task. "token selection and token pruning"
  • Trajectory-level instruction: an instruction that describes how a specific demonstration solves the task. "a refined trajectory-level instruction"
  • Visual evidence budget: an explicit limit on how much visual information the policy can retain, encouraging focus on essentials. "See Less imposes an explicit visual evidence budget"
  • Visual evidence budgeting: the process of learning and enforcing constraints on retained visual information during control. "explicit visual evidence budgeting provides complementary robustness gains"
  • Visual evidence masks: token- or patch-wise weights that modulate how much visual information passes to the policy. "introduce task-conditioned visual evidence masks mtb[0,1]Pbm_t^{b} \in [0,1]^{P_b} and mtw[0,1]Pwm_t^{w} \in [0,1]^{P_w}"
  • Visual tokens: discrete embeddings representing image patches or regions within a transformer pipeline. "over image patches or visual tokens in the base and wrist views"
  • Vision-language-action (VLA): models that map visual observations and language instructions to actions. "Vision-language-action (VLA) models have recently shown strong promise for robot manipulation"
  • Vision-LLM (VLM): models that jointly process images and text, often used for high-level planning or instruction refinement. "Modern vision-LLMs (VLMs) are stronger at open-ended instruction interpretation"
  • Visuomotor learning: learning policies that map visual inputs to motor outputs for control. "exploiting locality in visuomotor learning"
  • Velocity field: a learned vector field over a time variable that guides noisy action states toward targets in flow matching. "the backbone predicts a velocity field"
  • Wrist view: a camera viewpoint near the end-effector capturing contact-local details. "the wrist view often captures contact-local detail"
  • Embodiment-specific noise: variability arising from a robot’s hardware properties and actuation that affects control reliability. "embodiment-specific noise"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 67 likes about this paper.