Papers
Topics
Authors
Recent
Search
2000 character limit reached

TVTheseus: Graph-Based TV Navigation

Updated 26 January 2026
  • TVTheseus is a foundation model that models TV navigation using directed graphs to capture UI state transitions and focus localization.
  • The model combines topology-priming supervised fine-tuning with topology-augmented reinforcement learning to achieve state-of-the-art benchmark performance.
  • TVTheseus outperforms point-and-click LVLMs by effectively leveraging UI topology cues for robust, long-horizon navigation and improved focus accuracy.

TVTheseus is a foundation model for remote-control television (TV) navigation, developed within the TVWorld framework to address the specific challenges of focus-based, long-horizon device control via remote interfaces. Whereas prior research in large vision-LLMs (LVLMs) for device automation focused on point-and-click paradigms, TVTheseus is explicitly designed for topology-aware traversal of TV user interfaces (UIs) modeled as directed graphs. The model leverages a two-stage training regimen—supervised fine-tuning with topology-priming and offline reinforcement learning (RL) with topology-augmented rewards—to achieve state-of-the-art (SOTA) navigation accuracy on comprehensive offline benchmarks. TVTheseus demonstrates significant improvements over general LVLMs and point-and-click specialists, particularly for tasks requiring global UI topology reasoning and robust focus localization (Ma et al., 19 Jan 2026).

1. TVWorld Graph Abstraction and Benchmarking

TVTheseus is built on the TVWorld offline abstraction, which models TV navigation as a deterministic directed labeled graph G=(V,E,λ)\mathcal{G} = (\mathcal{V}, \mathcal{E}, \lambda):

  • V\mathcal{V}: Finite set of all reachable UI states, each labeled with a screenshot S(u)S(u), valid actions A(u)\mathcal{A}(u) from the fixed key set A={UP,DOWN,LEFT,RIGHT,OK,EXIT,HOME,SETTING,FINISH}\mathcal{A} = \{\mathrm{UP}, \mathrm{DOWN}, \mathrm{LEFT}, \mathrm{RIGHT}, \mathrm{OK}, \mathrm{EXIT}, \mathrm{HOME}, \mathrm{SETTING}, \mathrm{FINISH}\}, and optional metadata m(u)m(u).
  • E⊆V×A×V\mathcal{E} \subseteq \mathcal{V} \times \mathcal{A} \times \mathcal{V}: Set of labeled edges encoding key-press transitions (u,a,v)(u, a, v), realized by deterministic transition function T(u,a)=vT(u, a) = v.
  • Global connectivity is quantified via adjacency matrices AA and random-walk matrices P=D−1AP = D^{-1}A, underpinning notions such as hitting-time distance d(u,g)d(u, g).

Two main benchmarks are derived:

Benchmark Task Type Evaluation Metric Notable Statistics
TVWorld-N Topology-aware navigation, 500 tasks (text, vision), across 5 graphs Success Rate (SR) \ #success#episodes×100%\frac{\#\text{success}}{\#\text{episodes}} \times 100\% Existing LVLMs: SR << 40%
TVWorld-G Focus-aware grounding, 187 UI states Acc@0.5\text{Acc}@0.5: IoU(b^,b)≥0.5\mathrm{IoU}(\hat b, b) \ge 0.5 LVLM baseline ≈\approx 65%

These benchmarks expose LVLMs' limitations in long-horizon compositional navigation, motivating methods that encode UI topology, support trace-level rationales, and leverage graph-structured supervision (Ma et al., 19 Jan 2026).

2. Model Architecture and Input Structure

TVTheseus builds upon the Qwen3-VL visual-language encoder, supplemented with an action-prediction head designed for sequential decision-making:

  • Observation: At each step, the model receives
    • A high-resolution screenshot StS_t (1024×5761024 \times 576, ∼576\sim576 ViT tokens).
    • Up to four frame histories (prior screenshots).
    • Full action history.
    • Textual or visual goal instruction II.
  • Output: For each timestep, produces
    • A natural-language rationale.
    • A single key action (within a tagged <answer> field).

No UI DOM or view-tree features are provided. During SFT, DeepSpeed ZeRO-1 is used; RL utilizes vLLM for efficient batching and generation.

Empirically, ablation shows peak performance with the described resolution and frame buffer; both fewer and excess visual tokens/history frames degrade results (SR drops from 68.3% to 61–64%) (Ma et al., 19 Jan 2026).

3. Topology-Aware Training Framework

The TVTheseus training protocol is split into two sequential stages:

  1. Topology-Priming Supervised Fine-Tuning (SFT):
    • Uses $500$ start–goal pairs from TCL-TV graphs.
    • For each, shortest-path traces p∗\mathbf{p}^* are sampled; each trace-point generates three SFT variants:
      • Geodesic Guidance: Favours actions reducing topological distance d(ut,g)d(u_t, g), supported by rationales.
      • Detour Reflection: Simulates 'farther' actions and corrective returns; rationales justify corrections.
      • Stagnation Escape: Inserts ineffective moves, with rationales identifying and escaping stagnation.
    • Qwen3-VL-8B-Instruct is fine-tuned via cross-entropy loss on rationale-action pairs, yielding TVTheseus-Base.
  2. Topology-Augmented Reinforcement Learning (RL):
    • Employs a fully offline RL environment, exposing the TVWorld transition function.
    • Uses Group-Relative Policy Optimization (GRPO): per state utu_t, KK candidate outputs are generated and scored.
    • Rewards:
      • Topology-shaping Rtopo(ut,a;g)R_\mathrm{topo}(u_t, a; g) tailored per trace type, e.g. for geodesic steps: $1.0$ for d(u′,g)<d(ut,g)d(u', g) < d(u_t, g), $0.2$ if equal, $0.0$ otherwise.
      • Format-validity bonus Rform∈{0,1}R_\mathrm{form} \in \{0, 1\} for well-formed outputs.
      • Aggregate: R=βtopoRtopo+βformRformR = \beta_\mathrm{topo} R_\mathrm{topo} + \beta_\mathrm{form} R_\mathrm{form}, with βtopo=0.95\beta_\mathrm{topo} = 0.95, βform=0.05\beta_\mathrm{form} = 0.05.
    • The RL objective standardizes group rewards and optimizes via a clipped surrogate loss with KL regularization.

Ablation indicates both rationale types (SFT) and topology-driven reward shaping (RL) are critical: removing rationales reduces SR from 68.3%68.3\% to 36.8%36.8\%; distance-only rewards cap at 64%64\% (Ma et al., 19 Jan 2026).

4. Quantitative Performance and Benchmark Results

TVTheseus establishes leading accuracy both in topology-aware navigation and focus-aware grounding:

  • TVWorld-N (500 out-of-domain tasks):
    • 68.3%−2.7+3.168.3\%^{+3.1}_{-2.7} overall SR, surpassing Gemini 3 Flash (66.4%66.4\%), GPT-5 mini (60.2%60.2\%), Qwen3-VL-32B (39.0%39.0\%), and specialized PnC agents (≤15.4%)(\leq15.4\%).
    • RL is transformative: SFT-only yields 20.0%20.0\%; post-RL, 68.3%68.3\% is attained.
  • TVWorld-G (187 grounding tasks):
    • [email protected] =81.8%= 81.8\%, outpacing Qwen3-VL-8B (78.1%78.1\%) and PnC methods (max 65.2%65.2\%).
    • Notably, TVTheseus achieves this without explicit bounding-box supervision, implying an emergent transfer of topology reasoning to focus localization.
Model TVWorld-N SR TVWorld-G [email protected]
TVTheseus 68.3% 81.8%
Gemini 3 Flash 66.4% —
GPT-5 mini 60.2% —
Qwen3-VL-8B — 78.1%
PnC specialists ≤15.4% 39.5–65.2%

— exact metric not reported for some models

Significant performance is attributed to explicit topology cues and multi-stage training optimizing for UI graph traversal rather than surface-level screenshot matching (Ma et al., 19 Jan 2026).

5. Analysis of Key Design Choices and Generalization

Empirical studies within (Ma et al., 19 Jan 2026) isolate principal determinants of TVTheseus effectiveness:

  • Rationale-Driven Imitation: Trace-level rationales encoding UI topology, progress, detour correction, and stagnation significantly improve generalization compared to trajectory-only SFT.
  • Distance Metrics: Hitting-time and shortest-path distances facilitate superior topology shaping compared to personalized PageRank (PPR), with the former achieving up to 67.2%67.2\% SR (vs. 60.8%60.8\% for PPR).
  • Reward Design: Topology-aware shaping, as opposed to undifferentiated distance rewards, amplifies navigation robustness.
  • Visual Backbone & Context Window: High-resolution image encodings (576 ViT tokens) with a moderate history buffer (4 frames) provide optimal context for long-horizon decision making; excessive or insufficient context reduces SR.
  • Omission of Explicit Metadata: Additional metadata such as view-trees is not essential to achieving SOTA navigation performance.

These ablations underscore that focus-based TV navigation requires explicit modeling of UI graph structure and agent strategies anchored in topological features, rather than reliance on local visual focus or pointer-based paradigms.

6. Implications for GUI Agent Design

Findings related to TVTheseus suggest several broader insights for device-control agent research:

  1. Graph Search as Core Paradigm: Remote-control navigation fundamentally poses a graph-search problem, necessitating architectures and training protocols that promote internalization of UI topology.
  2. Explicit Topology Cues: Incorporating shortest-path signals, detour reflections, and stagnation recovery—both in rationales and rewards—enhances robustness, sample efficiency, and generalization across unseen interface variations.
  3. Sequential SFT+RL Training: Supervised imitation with rationales primes agents for topology awareness, while offline RL with graph-structured rewards refines policies and enables flexible adaptation to out-of-domain graphs.
  4. Minimal Structural Supervision: Effective policies arise from screenshot histories alone, obviating the need for additional UI metadata in deployment.
  5. Broader Applicability: The TVWorld/TVTheseus approach is readily extensible to other remote-control device domains (e.g., set-top boxes, appliances) wherever interaction graphs can be statically constructed.

A plausible implication is that graph-aware pretraining and multi-stage training regimens may become standard in the development of future device control agents, particularly as interaction UIs become increasingly complex and heterogeneous (Ma et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TVTheseus.