Papers
Topics
Authors
Recent
2000 character limit reached

TimeLens-100K: High-Quality VTG Dataset

Updated 18 December 2025
  • The paper details an automated, LLM-driven pipeline that re-annotates legacy VTG datasets, ensuring precise event boundaries and high-quality query–segment pairs.
  • TimeLens-100K is a large-scale, temporally fine-grained VTG dataset containing around 100,000 annotations derived from 20,000 videos with rigorous quality control.
  • Integrating reinforcement learning techniques, the dataset achieves significant performance gains in recall and mIoU, setting a new standard for VTG evaluation.

TimeLens-100K refers to multiple large-scale datasets in the vision and astrophysics communities, each advancing time-resolved benchmarks and modeling in distinct domains. The most prominent definition, as established by "TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs" (Zhang et al., 16 Dec 2025), is a high-quality, large-scale video temporal grounding (VTG) training dataset constructed via large multimodal LLMs. Several other works, including those on event-based video frame interpolation and simulated gravitational lens light curves, also use the phrase "TimeLens-100K" to indicate datasets of order 10410^410510^5 samples, but these serve very different research purposes. What unifies these usages is the focus on temporally fine-grained, automatically or semi-automatically annotated data at scales useful for both model training and benchmarking.

1. Automated Construction of the TimeLens-100K VTG Dataset

TimeLens-100K (Zhang et al., 16 Dec 2025) is constructed by systematically re-annotating and filtering heterogeneous legacy VTG datasets (e.g., DiDeMo, CosMo-Cap, InternVid-VTime, QuerYD, HiREST) via a multi-stage, LLM-driven pipeline. The pipeline leverages state-of-the-art multimodal LLMs (specifically Gemini-2.5-Pro) for automatic proposal, description, and verification of events in video, ensuring annotation quality and scalability. The steps are as follows:

  1. Video Sampling: Source videos are stratified by duration to yield 20,000\approx 20{,}000 videos with a uniform length distribution (up to 240 s, with longer outliers retained).
  2. Frame Extraction & Preprocessing: Frames are extracted at 2 fps; consecutive pairs are merged in the vision encoder to match LLM input requirements, but are retained individually for annotation.
  3. Automated Event Proposal & Query Generation: The LLM is prompted to identify KK diverse, non-overlapping events per video, generate for each event a natural-language query qiq_i, and specify precise start/end timestamps S^i=(t^istart,t^iend)\hat{S}_i = (\hat{t}_i^\text{start}, \hat{t}_i^\text{end}) spanning the video uniformly.
  4. Self-Verification & Filtering: The LLM is further prompted to assess each (query, segment) pair against stringent quality criteria—existence, clarity, uniqueness, boundary precision, and no temporal leakage. Any pair failing self-consistency is discarded.
  5. Aggregation: Surviving annotated pairs are collected into the final corpus, yielding 100,000\approx 100,000 query–segment annotations.

This process is formalized via algorithmic pseudocode, and is designed for maximal scalability with minimal manual intervention.

2. Annotation Criteria, Quality Control, and Spot-Check Validation

The annotation and filtering stages enforce strict criteria previously formalized in "TimeLens-Bench," demanding:

  • Query Clarity & Specificity: qq must unambiguously identify exactly one event.
  • Event Existence: The event described must verifiably occur in the video.
  • Query Uniqueness: No two queries in the same video may reference the same event.
  • No Temporal Leakage: Queries must not contain temporal locators or hints (e.g., “in the ending”).
  • Boundary Precision: The labeled segment S^\hat{S} must tightly enclose the event and exclude unrelated frames.
  • Exhaustiveness: No other segment outside S^\hat{S} should satisfy qq.

Although dual human annotation is not employed at this scale, a manual vendor spot-check of 2,000 re-annotated pairs reports above 95% agreement with expert annotators. Cohen’s κ\kappa formula is noted for potential inter-annotator agreement quantification, but not computed on the released set.

3. Corpus Statistics and Format

TimeLens-100K contains:

  • 20,000\approx 20,000 videos (V|V|), with an aggregate duration of 555\approx 555 hours.
  • 100,000\approx 100,000 query–segment annotations (D|D|), averaging 5 annotations per video.
  • Mean segment duration μ=12.3\mu = 12.3 s, segment duration variance σ2=65.4\sigma^2 = 65.4 s2^2; 80% of segments are 3–20 s.
  • Each annotation is a JSON record:

1
2
3
4
5
6
{
  "video_id": "string",
  "query_text": "string",
  "start_time": float,
  "end_time": float
}

  • Descriptive statistics use:

μ=1Ni=1Ndi,σ2=1Ni=1N(diμ)2\mu = \frac{1}{N} \sum_{i=1}^N d_i, \quad \sigma^2 = \frac{1}{N} \sum_{i=1}^N (d_i - \mu)^2

4. Integration with Reinforcement Learning for Video Temporal Grounding

TimeLens-100K is central to “thinking-free” Reinforcement Learning with Verifiable Rewards (RLVR) for MLLM-based VTG (Zhang et al., 16 Dec 2025). Given a training tuple (video, query) and a predicted segment, reward is defined as:

R={tIoU(spred,sgt)if tIoU>τ 0otherwise,τ=0.1R = \begin{cases} \mathrm{tIoU}(s_\mathrm{pred}, s_\mathrm{gt}) & \text{if } \mathrm{tIoU} > \tau \ 0 & \text{otherwise} \end{cases}, \quad \tau = 0.1

where tIoU is the temporal Intersection-over-Union between predicted and ground-truth segments.

Training employs Grouped-ROLIE PPO (GRPO) to maximize within-minibatch relative advantage:

LGRPO=i=1G(r(i)rˉ)logπθ(y(i)v,q)\mathcal{L}_\mathrm{GRPO} = -\sum_{i=1}^G (r^{(i)} - \bar{r}) \log \pi_\theta(y^{(i)}|v,q)

Difficulty-aware sampling is conducted by estimating instance difficulty di=1tIoUid_i=1-\mathrm{tIoU}_i and resampling according to a Gaussian mixture (μ=0.05\mu=0.05, σ=0.2\sigma=0.2).

5. Performance Impact and Benchmark Re-Ranking

Use of TimeLens-100K in place of noisy legacy corpora produces state-of-the-art results on the re-annotated TimeLens-Bench evaluation suite, including:

Benchmark [email protected] (Noisy/100K) ΔR1 mIoU (Noisy/100K) ΔmIoU
Charades-TimeLens 52.6 → 70.0 +17.4 35.6 → 48.3 +12.7
ActivityNet-TimeLens 45.0 → 57.9 +12.9 31.3 → 43.1 +11.8
QVHighlights-TimeLens 61.3 → 73.0 +11.7 44.6 → 56.7 +12.1

These improvements are solely attributable to data quality, not changes in model architecture or benchmark curation. Open-source models trained on TimeLens-100K match or exceed proprietary systems such as GPT-5 and Gemini-2.5-Flash on rigorous VTG evaluations.

6. Distinctions from Other "TimeLens-100K" Usages

Other works employing the "TimeLens-100K" moniker refer to frame interpolation and astrophysics datasets:

  • In event-based frame interpolation, "TimeLens-100K" is sometimes used colloquially for the HS-ERGB or BS-ERGB datasets, which consist of high-frame-rate scenes with synchronized event and RGB data spanning up to 105\sim 10^5 frames (Tulyakov et al., 2021, Tulyakov et al., 2022). These datasets are designed for low-level motion analysis, not semantic VTG.
  • In strong lensing cosmology, "TimeLens-100K" can denote a simulated catalog of 10510^5 lensed quasar and supernova light curves, constructed for time-delay recovery challenge experiments (Hojjati et al., 2014, Vernardos, 2021). These datasets follow completely different simulation and annotation pipelines, with no overlap in application with video grounding.

7. Summary and Community Significance

TimeLens-100K, as formalized by (Zhang et al., 16 Dec 2025), sets a new standard for scale and annotation rigor in VTG dataset construction. Its automated, LLM-based methodology enables fast, uniform, and high-quality labeling, promoting reproducible evaluation and reliable training for time-aware multimodal LLMs. The dataset's impact is demonstrated through dramatic gains in recall and mIoU on strict evaluation suites, providing the necessary data and training setup for advancing temporal video understanding. By open-sourcing TimeLens-100K and associated RLVR training recipes, the authors facilitate the development of robust, transparent VTG methodologies for the research community.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TimeLens-100K.