DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

Published 29 Apr 2026 in cs.CV | (2604.26565v1)

Abstract: Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and LLMs (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at https://huggingface.co/datasets/mingjige/DenseStep2M.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a training-free pipeline that uses multimodal and reasoning LLMs to achieve dense, temporally precise annotation of instructional videos.
The paper’s method produces 2M annotated steps across 100K videos, significantly improving semantic and temporal alignment (F1: 75.8%, mIoU: 42.9%).
The paper demonstrates enhanced downstream performance in video captioning, step grounding, and cross-modal retrieval, paving the way for scalable video-language models.

DenseStep2M: A Training-Free, Scalable Pipeline for Dense Instructional Video Annotation

Introduction

DenseStep2M introduces a robust, fully automated pipeline for the large-scale annotation of instructional videos, producing temporally grounded procedural step annotations that are essential for advanced long-term video understanding. The authors address fundamental limitations of previous instructional video datasets—in particular noisy ASR, poor temporal alignment between narration and visual activity, and the lack of dense, semantically rich step-level annotations—by leveraging state-of-the-art Multimodal LLMs (MLLMs) and advanced reasoning LLMs without requiring additional supervised training.

The pipeline yields DenseStep2M, a dataset with 100K curated instructional videos annotated with approximately 2 million timestamped procedural steps. The work further releases DenseCaption100, a dense, human-annotated benchmark for downstream evaluation. Evaluations of both the pipeline and dataset demonstrate pronounced improvements over extant approaches in semantic and temporal alignment of procedural steps, with immediate positive impact for fine-tuning and validating V-(L)LMs on long-form video understanding, captioning, procedural step grounding, and cross-modal retrieval.

Figure 1: Three-step pipeline for Stage1 (Video Filtering and Segmentation), utilizing WhisperX for ASR, Qwen2.5 for segmentation, and DeepSeek-R1 plus Qwen2.5-VL for visual-semantic alignment.

Pipeline Architecture

The proposed pipeline is decomposed into three principal stages:

Stage 1: Video Filtering and Segmentation

Initial filtering is performed on the HowTo100M dataset using WhisperX-generated ASR transcripts (significantly improving timestamp and transcript quality over the original YouTube pipeline). Qwen2.5 is used to produce video titles and segment the videos into candidate instructional and non-instructional shots based on transcript semantics. The pipeline then invokes DeepSeek-R1 for further ASR refinement and applies Qwen2.5-VL to verify cross-modal semantic alignment between video and transcript segments. Only videos with over 75% instructional content—coinciding in both text and visual modalities—are retained for annotation.

Figure 3: Example of a HowTo100M video segment dominated by non-instructional content, illustrating the rationale for strict filtering.

Stage 2: Instructional Step Generation

Videos are split into visually coherent shots. For each shot, Qwen2.5-VL is provided with sampled frames, local ASR transcript, and the global video title. The model outputs natural-language procedural steps, each paired with explicit start/end timestamps. This multimodal conditioning ensures that generated steps capture both visual actions and narrated content, addressing the classic missed alignment issue of previous text-only pipelines.

Figure 4: Issues in initial step generation (Qwen2.5-VL): disordered steps, hallucinated actions, redundant steps, and inaccurate timestamps.

The shot-level candidate steps are then globally audited and rectified by DeepSeek-R1. Taking the global ASR and title as context, DeepSeek-R1 eliminates fragmented, hallucinated, or redundant actions, aligns temporal boundaries, and merges semantically repeated or closely related steps into a canonical, globally ordered list per video. Crucially, this stage prioritizes logical coherence and narrative consistency, leveraging the reasoning LLM's causal inference capabilities.

Figure 2: The right panel shows DeepSeek-R1 refinement, merging, and correcting hallucinated and imprecise initial step proposals.

Dataset Analysis

The resulting DenseStep2M dataset comprises 99K high-purity instructional videos with over 1.9M annotated steps, averaging 19 steps per video—2x the step density of previous open-vocabulary datasets, and a mean step length of 10.4 words. The dataset skews towards Food and Entertaining (59.1%), but also contains substantial data from Hobbies and Crafts, and Home and Garden (see summary plots below).

Figure 5: Dataset statistics: category distribution, histogram of step-per-video counts, and step word count.

DenseStep2M annotations showcase:

Abstract and context-aware steps transcending surface visual actions.
Explicit multimodal grounding: durations, object quantities, and inferred purposes.
Systematic filtering for high instructional content and visual-textual alignment.

Furthermore, a manual evaluation demonstrated a dramatic increase in visually and temporally well-aligned steps (from 15.2%/21.3% in HowTo100M to 56.1%/62.0% in DenseStep2M).

Evaluation on Annotation Quality

On the DenseCaption100 benchmark, the full pipeline (Qwen2.5-VL + DeepSeek-R1) outperforms all baselines on F1 (75.8%) and mIoU (42.9%) for step alignment and timestamping, exceeding the prior state-of-the-art by substantial margins. Ablations confirm the necessity of high-capacity VLLMs and the critical synergy between initial visual anchoring and global reasoning-based refinement: the former supplies perceptual precision, the latter ensures logical and narrative coherence.

In addition, the pipeline achieves a METEOR score of 9.4, CIDEr 33.6, and SODA_c 5.6 on YouCook2, outperforming the best non-pretrained models (METEOR +3.3, SODA_c +0.3 absolute).

Figure 6: Illustration of R1-Score computation, demonstrating semantic matching, aggregated timestamp alignment, and computation of precision/recall/F1/mIoU.

Downstream Task Utility

Captioning and Grounding

Models fine-tuned on DenseStep2M demonstrate marked improvements in both dense video captioning (R1-mIoU +4.2%) and procedural step grounding (+8.6% mIoU) compared to vision-agnostic or text-only baselines. Vision-LLMs not fine-tuned on DenseStep2M underperform significantly, confirming the superior utility of this dataset for both semantic and temporal video understanding.

DenseStep2M-finetuned models (Qwen3-VL-8B, Embedding and Instruct variants) exhibit substantial improvements in exocentric domain tasks (YouCook2 R@1 +29.0%) and competitive/peak performance across egocentric, exocentric, and mixed (ego-exo) video-language retrieval tasks, especially when combined with complementary Ego4D pretraining. Notably, the dataset significantly mitigates exocentric blind-spots that limit ego-centric only training, supporting robust, perspective-invariant representations.

Figure 7: Visualization of downstream tasks enabled by DenseStep2M: (a) dense video captioning and (b) procedural step grounding.

Qualitative Results

Pipeline outputs maintain coherence, temporal accuracy, and step granularity for both short and long instructional videos. Step annotations include rich multimodal references and abstract, context-driven actions.

Figure 8: Qualitative examples from short videos demonstrate robust segmentation and step annotation fidelity by the pipeline.

Implications and Future Directions

DenseStep2M demonstrates that advanced, training-free LLM pipelines, properly architected, can achieve large-scale, semantically rich, and temporally precise annotation of in-the-wild video corpora. The shift from heuristic, noisy ASR-based alignment to joint multimodal and reasoning LLM curation supports research into fine-grained procedural understanding, long-horizon activity reasoning, and robust generative video-language modeling.

The dataset and pipeline enable the development of V-(L)LMs capable of truly scalable cross-task and cross-perspective transfer, bridging exocentric and egocentric video understanding, supporting embodied AI, and facilitating human-assistive applications.

Limitations include residual dependence on LLM and ASR errors, partial domain imbalance, and the upstream propagation of source data biases. Future work should integrate causal scene graph construction, unified video and dialog annotation, and further diversify sourcing beyond existing corpora.

Conclusion

DenseStep2M establishes a new paradigm in the automatic, training-free annotation of dense instructional video data, unifying high-density procedural annotation, multimodal temporal grounding, and large-scale scalability. The pipeline sets a new benchmark for dataset quality and enables state-of-the-art performance on a variety of downstream long-form video tasks, significantly advancing the capabilities and generality of video-language modeling systems.

Markdown Report Issue