TimeLens-100K: High-Quality VTG Dataset
- The paper details an automated, LLM-driven pipeline that re-annotates legacy VTG datasets, ensuring precise event boundaries and high-quality query–segment pairs.
- TimeLens-100K is a large-scale, temporally fine-grained VTG dataset containing around 100,000 annotations derived from 20,000 videos with rigorous quality control.
- Integrating reinforcement learning techniques, the dataset achieves significant performance gains in recall and mIoU, setting a new standard for VTG evaluation.
TimeLens-100K refers to multiple large-scale datasets in the vision and astrophysics communities, each advancing time-resolved benchmarks and modeling in distinct domains. The most prominent definition, as established by "TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs" (Zhang et al., 16 Dec 2025), is a high-quality, large-scale video temporal grounding (VTG) training dataset constructed via large multimodal LLMs. Several other works, including those on event-based video frame interpolation and simulated gravitational lens light curves, also use the phrase "TimeLens-100K" to indicate datasets of order – samples, but these serve very different research purposes. What unifies these usages is the focus on temporally fine-grained, automatically or semi-automatically annotated data at scales useful for both model training and benchmarking.
1. Automated Construction of the TimeLens-100K VTG Dataset
TimeLens-100K (Zhang et al., 16 Dec 2025) is constructed by systematically re-annotating and filtering heterogeneous legacy VTG datasets (e.g., DiDeMo, CosMo-Cap, InternVid-VTime, QuerYD, HiREST) via a multi-stage, LLM-driven pipeline. The pipeline leverages state-of-the-art multimodal LLMs (specifically Gemini-2.5-Pro) for automatic proposal, description, and verification of events in video, ensuring annotation quality and scalability. The steps are as follows:
- Video Sampling: Source videos are stratified by duration to yield videos with a uniform length distribution (up to 240 s, with longer outliers retained).
- Frame Extraction & Preprocessing: Frames are extracted at 2 fps; consecutive pairs are merged in the vision encoder to match LLM input requirements, but are retained individually for annotation.
- Automated Event Proposal & Query Generation: The LLM is prompted to identify diverse, non-overlapping events per video, generate for each event a natural-language query , and specify precise start/end timestamps spanning the video uniformly.
- Self-Verification & Filtering: The LLM is further prompted to assess each (query, segment) pair against stringent quality criteria—existence, clarity, uniqueness, boundary precision, and no temporal leakage. Any pair failing self-consistency is discarded.
- Aggregation: Surviving annotated pairs are collected into the final corpus, yielding query–segment annotations.
This process is formalized via algorithmic pseudocode, and is designed for maximal scalability with minimal manual intervention.
2. Annotation Criteria, Quality Control, and Spot-Check Validation
The annotation and filtering stages enforce strict criteria previously formalized in "TimeLens-Bench," demanding:
- Query Clarity & Specificity: must unambiguously identify exactly one event.
- Event Existence: The event described must verifiably occur in the video.
- Query Uniqueness: No two queries in the same video may reference the same event.
- No Temporal Leakage: Queries must not contain temporal locators or hints (e.g., “in the ending”).
- Boundary Precision: The labeled segment must tightly enclose the event and exclude unrelated frames.
- Exhaustiveness: No other segment outside should satisfy .
Although dual human annotation is not employed at this scale, a manual vendor spot-check of 2,000 re-annotated pairs reports above 95% agreement with expert annotators. Cohen’s formula is noted for potential inter-annotator agreement quantification, but not computed on the released set.
3. Corpus Statistics and Format
TimeLens-100K contains:
- videos (), with an aggregate duration of hours.
- query–segment annotations (), averaging 5 annotations per video.
- Mean segment duration s, segment duration variance s; 80% of segments are 3–20 s.
- Each annotation is a JSON record:
1 2 3 4 5 6 |
{
"video_id": "string",
"query_text": "string",
"start_time": float,
"end_time": float
} |
- Descriptive statistics use:
4. Integration with Reinforcement Learning for Video Temporal Grounding
TimeLens-100K is central to “thinking-free” Reinforcement Learning with Verifiable Rewards (RLVR) for MLLM-based VTG (Zhang et al., 16 Dec 2025). Given a training tuple (video, query) and a predicted segment, reward is defined as:
where tIoU is the temporal Intersection-over-Union between predicted and ground-truth segments.
Training employs Grouped-ROLIE PPO (GRPO) to maximize within-minibatch relative advantage:
Difficulty-aware sampling is conducted by estimating instance difficulty and resampling according to a Gaussian mixture (, ).
5. Performance Impact and Benchmark Re-Ranking
Use of TimeLens-100K in place of noisy legacy corpora produces state-of-the-art results on the re-annotated TimeLens-Bench evaluation suite, including:
| Benchmark | [email protected] (Noisy/100K) | ΔR1 | mIoU (Noisy/100K) | ΔmIoU |
|---|---|---|---|---|
| Charades-TimeLens | 52.6 → 70.0 | +17.4 | 35.6 → 48.3 | +12.7 |
| ActivityNet-TimeLens | 45.0 → 57.9 | +12.9 | 31.3 → 43.1 | +11.8 |
| QVHighlights-TimeLens | 61.3 → 73.0 | +11.7 | 44.6 → 56.7 | +12.1 |
These improvements are solely attributable to data quality, not changes in model architecture or benchmark curation. Open-source models trained on TimeLens-100K match or exceed proprietary systems such as GPT-5 and Gemini-2.5-Flash on rigorous VTG evaluations.
6. Distinctions from Other "TimeLens-100K" Usages
Other works employing the "TimeLens-100K" moniker refer to frame interpolation and astrophysics datasets:
- In event-based frame interpolation, "TimeLens-100K" is sometimes used colloquially for the HS-ERGB or BS-ERGB datasets, which consist of high-frame-rate scenes with synchronized event and RGB data spanning up to frames (Tulyakov et al., 2021, Tulyakov et al., 2022). These datasets are designed for low-level motion analysis, not semantic VTG.
- In strong lensing cosmology, "TimeLens-100K" can denote a simulated catalog of lensed quasar and supernova light curves, constructed for time-delay recovery challenge experiments (Hojjati et al., 2014, Vernardos, 2021). These datasets follow completely different simulation and annotation pipelines, with no overlap in application with video grounding.
7. Summary and Community Significance
TimeLens-100K, as formalized by (Zhang et al., 16 Dec 2025), sets a new standard for scale and annotation rigor in VTG dataset construction. Its automated, LLM-based methodology enables fast, uniform, and high-quality labeling, promoting reproducible evaluation and reliable training for time-aware multimodal LLMs. The dataset's impact is demonstrated through dramatic gains in recall and mIoU on strict evaluation suites, providing the necessary data and training setup for advancing temporal video understanding. By open-sourcing TimeLens-100K and associated RLVR training recipes, the authors facilitate the development of robust, transparent VTG methodologies for the research community.