TVTheseus: Graph-Based TV Navigation
- TVTheseus is a foundation model that models TV navigation using directed graphs to capture UI state transitions and focus localization.
- The model combines topology-priming supervised fine-tuning with topology-augmented reinforcement learning to achieve state-of-the-art benchmark performance.
- TVTheseus outperforms point-and-click LVLMs by effectively leveraging UI topology cues for robust, long-horizon navigation and improved focus accuracy.
TVTheseus is a foundation model for remote-control television (TV) navigation, developed within the TVWorld framework to address the specific challenges of focus-based, long-horizon device control via remote interfaces. Whereas prior research in large vision-LLMs (LVLMs) for device automation focused on point-and-click paradigms, TVTheseus is explicitly designed for topology-aware traversal of TV user interfaces (UIs) modeled as directed graphs. The model leverages a two-stage training regimen—supervised fine-tuning with topology-priming and offline reinforcement learning (RL) with topology-augmented rewards—to achieve state-of-the-art (SOTA) navigation accuracy on comprehensive offline benchmarks. TVTheseus demonstrates significant improvements over general LVLMs and point-and-click specialists, particularly for tasks requiring global UI topology reasoning and robust focus localization (Ma et al., 19 Jan 2026).
1. TVWorld Graph Abstraction and Benchmarking
TVTheseus is built on the TVWorld offline abstraction, which models TV navigation as a deterministic directed labeled graph :
- : Finite set of all reachable UI states, each labeled with a screenshot , valid actions from the fixed key set , and optional metadata .
- : Set of labeled edges encoding key-press transitions , realized by deterministic transition function .
- Global connectivity is quantified via adjacency matrices and random-walk matrices , underpinning notions such as hitting-time distance .
Two main benchmarks are derived:
| Benchmark | Task Type | Evaluation Metric | Notable Statistics |
|---|---|---|---|
| TVWorld-N | Topology-aware navigation, 500 tasks (text, vision), across 5 graphs | Success Rate (SR) \ | Existing LVLMs: SR 40% |
| TVWorld-G | Focus-aware grounding, 187 UI states | : | LVLM baseline 65% |
These benchmarks expose LVLMs' limitations in long-horizon compositional navigation, motivating methods that encode UI topology, support trace-level rationales, and leverage graph-structured supervision (Ma et al., 19 Jan 2026).
2. Model Architecture and Input Structure
TVTheseus builds upon the Qwen3-VL visual-language encoder, supplemented with an action-prediction head designed for sequential decision-making:
- Observation: At each step, the model receives
- A high-resolution screenshot (, ViT tokens).
- Up to four frame histories (prior screenshots).
- Full action history.
- Textual or visual goal instruction .
- Output: For each timestep, produces
- A natural-language rationale.
- A single key action (within a tagged <answer> field).
No UI DOM or view-tree features are provided. During SFT, DeepSpeed ZeRO-1 is used; RL utilizes vLLM for efficient batching and generation.
Empirically, ablation shows peak performance with the described resolution and frame buffer; both fewer and excess visual tokens/history frames degrade results (SR drops from 68.3% to 61–64%) (Ma et al., 19 Jan 2026).
3. Topology-Aware Training Framework
The TVTheseus training protocol is split into two sequential stages:
- Topology-Priming Supervised Fine-Tuning (SFT):
- Uses $500$ start–goal pairs from TCL-TV graphs.
- For each, shortest-path traces are sampled; each trace-point generates three SFT variants:
- Geodesic Guidance: Favours actions reducing topological distance , supported by rationales.
- Detour Reflection: Simulates 'farther' actions and corrective returns; rationales justify corrections.
- Stagnation Escape: Inserts ineffective moves, with rationales identifying and escaping stagnation.
- Qwen3-VL-8B-Instruct is fine-tuned via cross-entropy loss on rationale-action pairs, yielding TVTheseus-Base.
- Topology-Augmented Reinforcement Learning (RL):
- Employs a fully offline RL environment, exposing the TVWorld transition function.
- Uses Group-Relative Policy Optimization (GRPO): per state , candidate outputs are generated and scored.
- Rewards:
- Topology-shaping tailored per trace type, e.g. for geodesic steps: $1.0$ for , $0.2$ if equal, $0.0$ otherwise.
- Format-validity bonus for well-formed outputs.
- Aggregate: , with , .
- The RL objective standardizes group rewards and optimizes via a clipped surrogate loss with KL regularization.
Ablation indicates both rationale types (SFT) and topology-driven reward shaping (RL) are critical: removing rationales reduces SR from to ; distance-only rewards cap at (Ma et al., 19 Jan 2026).
4. Quantitative Performance and Benchmark Results
TVTheseus establishes leading accuracy both in topology-aware navigation and focus-aware grounding:
- TVWorld-N (500 out-of-domain tasks):
- overall SR, surpassing Gemini 3 Flash (), GPT-5 mini (), Qwen3-VL-32B (), and specialized PnC agents .
- RL is transformative: SFT-only yields ; post-RL, is attained.
- TVWorld-G (187 grounding tasks):
- [email protected] , outpacing Qwen3-VL-8B () and PnC methods (max ).
- Notably, TVTheseus achieves this without explicit bounding-box supervision, implying an emergent transfer of topology reasoning to focus localization.
| Model | TVWorld-N SR | TVWorld-G [email protected] |
|---|---|---|
| TVTheseus | 68.3% | 81.8% |
| Gemini 3 Flash | 66.4% | — |
| GPT-5 mini | 60.2% | — |
| Qwen3-VL-8B | — | 78.1% |
| PnC specialists | ≤15.4% | 39.5–65.2% |
— exact metric not reported for some models
Significant performance is attributed to explicit topology cues and multi-stage training optimizing for UI graph traversal rather than surface-level screenshot matching (Ma et al., 19 Jan 2026).
5. Analysis of Key Design Choices and Generalization
Empirical studies within (Ma et al., 19 Jan 2026) isolate principal determinants of TVTheseus effectiveness:
- Rationale-Driven Imitation: Trace-level rationales encoding UI topology, progress, detour correction, and stagnation significantly improve generalization compared to trajectory-only SFT.
- Distance Metrics: Hitting-time and shortest-path distances facilitate superior topology shaping compared to personalized PageRank (PPR), with the former achieving up to SR (vs. for PPR).
- Reward Design: Topology-aware shaping, as opposed to undifferentiated distance rewards, amplifies navigation robustness.
- Visual Backbone & Context Window: High-resolution image encodings (576 ViT tokens) with a moderate history buffer (4 frames) provide optimal context for long-horizon decision making; excessive or insufficient context reduces SR.
- Omission of Explicit Metadata: Additional metadata such as view-trees is not essential to achieving SOTA navigation performance.
These ablations underscore that focus-based TV navigation requires explicit modeling of UI graph structure and agent strategies anchored in topological features, rather than reliance on local visual focus or pointer-based paradigms.
6. Implications for GUI Agent Design
Findings related to TVTheseus suggest several broader insights for device-control agent research:
- Graph Search as Core Paradigm: Remote-control navigation fundamentally poses a graph-search problem, necessitating architectures and training protocols that promote internalization of UI topology.
- Explicit Topology Cues: Incorporating shortest-path signals, detour reflections, and stagnation recovery—both in rationales and rewards—enhances robustness, sample efficiency, and generalization across unseen interface variations.
- Sequential SFT+RL Training: Supervised imitation with rationales primes agents for topology awareness, while offline RL with graph-structured rewards refines policies and enables flexible adaptation to out-of-domain graphs.
- Minimal Structural Supervision: Effective policies arise from screenshot histories alone, obviating the need for additional UI metadata in deployment.
- Broader Applicability: The TVWorld/TVTheseus approach is readily extensible to other remote-control device domains (e.g., set-top boxes, appliances) wherever interaction graphs can be statically constructed.
A plausible implication is that graph-aware pretraining and multi-stage training regimens may become standard in the development of future device control agents, particularly as interaction UIs become increasingly complex and heterogeneous (Ma et al., 19 Jan 2026).