Papers
Topics
Authors
Recent
2000 character limit reached

ToG-Bench: Video Grounding Benchmark

Updated 10 December 2025
  • ToG-Bench is a benchmark for task-oriented spatio-temporal video grounding that emphasizes functional reasoning beyond simple object detection.
  • It comprises 100 egocentric videos with 2,704 task instructions and 4,194 object annotations across 177 functional categories, balancing explicit and implicit references.
  • It employs a semi-automated annotation pipeline and hierarchical evaluation metrics to assess recognition accuracy, spatial-temporal IoU, and multi-object coordination.

ToG-Bench is a benchmark for evaluating task-oriented spatio-temporal video grounding (T-STVG) in egocentric video, designed to test the capability of models to identify and temporally-localize goal-relevant objects in the context of real-world human tasks. Addressing shortcomings of previous datasets that emphasize object-centric or purely descriptive localization, ToG-Bench uniquely focuses on functional, task-driven grounding involving both explicit and implicit references, as well as one-to-many mappings between natural language instructions and visual objects (Xu et al., 3 Dec 2025).

1. Motivation and Conceptual Foundation

The central premise of ToG-Bench is that embodied agents require not only perceptual recognition but also the ability to infer functional relevance—distinguishing “what” is needed to complete a task, not merely “what is present.” Previous STVG benchmarks focus on identifying described or salient objects; however, intelligent agency in real environments relies on understanding the intent behind actions and localizing the ensemble of objects required to accomplish a task. ToG-Bench introduces functional reasoning challenges such as grounding objects referenced only by the task’s implied goal and coping with the combinatorial complexity of multi-object tasks.

2. Dataset Construction and Statistics

ToG-Bench leverages the ScanNet dataset as its video source, extracting 100 egocentric video clips averaging 87.88 seconds each (total ≈8,788 seconds). From these, 2,704 task-oriented instructions were generated, yielding 4,194 object instance annotations across 177 functional categories. Tasks are balanced: 51.3% feature explicit references (object type named directly), while 48.7% are implicit (requiring commonsense inference). Multi-object grounding is core to the benchmark: 50.3% of all tasks require localization of two or more objects, with 4.6% demanding grounding of three or more simultaneously. Implicit tasks ground on 1.97 objects on average, in contrast to explicit tasks (1.14), reflecting greater reasoning demand (Xu et al., 3 Dec 2025).

Attribute Quantity Notes
Videos 100 Egocentric; sourced from ScanNet
Avg. video duration 87.88 s Total ≈8,788 s
Task instructions 2,704 ≈27 per video
Object instances annotated 4,194 177 functional categories
Explicit/implicit split 51.3% / 48.7% Rigorous balancing
Multi-object instructions 50.3% ≥2 objects; 4.6% with ≥3 objects

3. Defining Features: Task-Oriented, Dual (Explicit–Implicit), One-to-Many Grounding

ToG-Bench is delineated by three features:

  • Task-Oriented Grounding: Unlike prior STVG tasks (“Find the blue cup”), ToG-Bench queries require identification based on implied utility (“Make coffee,” “Find something to write with”), demanding models to map intent to objects (e.g., “make coffee” → coffee machine, cup, water faucet).
  • Explicit-Implicit Dual Grounding: Explicit tasks list object categories directly; implicit tasks do not, instead embedding the requirement in context or activity, so the model must use commonsense and situational reasoning to infer suitable objects. Task balancing ensures robust challenge across both paradigms.
  • One-to-Many Mapping: Real-world tasks frequently require multiple items; instructions in ToG-Bench often entail identifying and localizing several items collectively necessary for task fulfillment. This structure evaluates simultaneous multi-object reasoning and localization under temporal constraints.

4. Semi-Automated Annotation Pipeline

Annotation proceeds in three semi-automated stages:

  1. Instruction Authoring: An advanced multimodal LLM (Gemini 2.5 Pro) processes each video to propose explicit and implicit task instructions, coupling each with a list of relevant objects. Automated filtering enforces that targets are visible, unique, and feasible.
  2. Object Grounding and Tracking: Object descriptions guide Grounding-DINO, which proposes initial bounding boxes. These are then temporally propagated by SAM2, generating “tubes” through video frames at 1 fps to capture the full temporal span of participation in a task.
  3. Human Verification: Manual review is performed in two rounds to eliminate occlusions, correct tracking drifts, and ensure that spatiotemporal boundaries accurately reflect the functional involvement of objects for both single- and multi-object instructions.

This top-down, instruction-first strategy yields high-precision, functionally relevant spatiotemporal object tubes aligned to task intent.

5. Evaluation Metrics

ToG-Bench utilizes a hierarchical evaluation that separates task-driven recognition from precise spatiotemporal localization, measuring both aspects:

  • Recognition Accuracy (Acc): Fraction of correctly matched predicted object categories, using cosine similarity of embeddings, thresholded for correctness.
  • Temporal Localization:
    • m_tIoU=(1/N)iTprediTgtiTprediTgti\mathrm{m\_tIoU} = (1/N) \sum_i \frac{|T_{\mathrm{pred}_i} \cap T_{\mathrm{gt}_i}|}{|T_{\mathrm{pred}_i} \cup T_{\mathrm{gt}_i}|}
    • R1@τ\mathrm{R1}@\tau: Recall at temporal IoU ≥ τ (τ{0.3,0.5,0.7}\tau \in \{0.3, 0.5, 0.7\})
  • Spatial Localization:
    • m_vIoU\mathrm{m\_vIoU}: Mean spatial IoU for predicted vs. ground-truth boxes
    • AP@τ\mathrm{AP}@\tau: Average Precision for IoU ≥ τ

These are reported at both the object (O-Acc, O-R1@τ, O-AP@τ) and task (T-Acc, T-R1@τ, T-AP@τ) levels, with explicit/implicit subtotals (EAcc/IAcc, T-EAP@τ, T-IAP@τ, etc.), enabling fine-grained analysis of model performance on multi-object and inference-heavy scenarios (Xu et al., 3 Dec 2025).

6. Benchmarking and Empirical Findings

Seven state-of-the-art multimodal LLMs (MLLMs) were assessed in a zero-shot regime, comprising both proprietary (e.g., GPT-5, Gemini 2.5 Pro) and open-source models (Qwen3-VL, VideoLLaMA3, InternVL, etc.). Key empirical results include:

  • Overall Task Accuracy (T-Acc) spans from approximately 28% (VideoLLaMA3) to 89.4% (GPT-5).
  • Temporal mIoU: Peaks at 41.6% (Gemini 2.5 Pro); Spatial mIoU: Peaks at 38.5% (Gemini 2.5 Pro). Even state-of-the-art models typically fall below 50% for fine-grained localization.
  • Explicit–Implicit Split: Large performance gaps; GPT-5 achieves 98.2% accuracy on explicit tasks but only 80.2% on implicit ones. Temporal mIoU for GPT-5 drops from 47.8% (explicit) to 32.6% (implicit), with even wider drops for most open-source models.
  • Multi-Object Complexity: For single-object tasks, top systems reach ≈96% T-Acc; this falls to ≈75% when three or more objects are required, with substantial spatial mIoU losses (15–20 points).

These results highlight persistent deficits in multi-object coordination and commonsense-driven inference, particularly on the implicit, higher-order cognitive tasks central to true embodied intelligence (Xu et al., 3 Dec 2025).

7. Impact and Forward Directions

ToG-Bench constitutes the inaugural large-scale benchmark for task-oriented spatio-temporal video grounding in egocentric settings, defining both evaluation protocols and a minimum challenge specification for next-generation embodied agents. Its findings reveal fundamental model limitations:

  • Gap between appearance-based and intent-based grounding
  • Difficulty generalizing from explicit to implicit task statements
  • Pronounced degradation of performance with multi-object instructions and increased temporal span

Proposed directions for progress include integrating world knowledge and functional reasoning modules to better address implicit grounding, and developing unified architectures that jointly optimize recognition and spatio-temporal coordination under egocentric motion. The curriculum-based, complexity-incremental training paradigms are plausible avenues to bridge the existing capability gap. By exposing these bottlenecks, ToG-Bench provides a critical reference point for progress toward autonomous agents capable of robust, task-driven interaction with their environment (Xu et al., 3 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ToG-Bench.