ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries (2511.14349v1)

Published 18 Nov 2025 in cs.CV

Abstract: The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.

Summary

The paper presents ARC-Chapter, a scalable multimodal framework that generates fine-grained, hierarchical chapters for hour-long videos using integrated visual and ASR cues.
It achieves state-of-the-art performance, outperforming competitors by up to +17.5 F1 and doubling SODA scores on benchmark datasets.
The approach demonstrates strong transferability to downstream tasks, setting a new paradigm for fine-grained video structuring and robust summarization.

ARC-Chapter: Hierarchical Structuring of Long-Form Videos for Navigability and Summarization

Motivation and Problem Statement

The increasing prevalence of hour-long video content in educational, entertainment, and professional contexts necessitates advanced methodologies for efficient structuring and summarization. Current solutions either lack scalability, are constrained to short video durations, or provide only coarse, non-descriptive chapter boundaries. Furthermore, standard benchmarks and evaluation metrics do not capture the granularity and semantic nuances critical for human-aligned chaptering in long-form videos, limiting both the development and assessment of robust models.

ARC-Chapter Framework and Dataset

ARC-Chapter addresses these limitations through a large-scale, bilingual (English-Chinese) annotation and training pipeline. The dataset, VidAtlas, comprises 410,000+ videos spanning 115,000+ hours, with a rich semantic and temporal diversity across domains (lectures, reviews, documentaries, tutorials, podcasts, etc). A novel semi-automatic pipeline extracts multimodal cues—visual frame captions (including OCR) and temporally aligned ASR—unifying them into a chronological transcript, which is processed by an LLM for dense, multilevel chapter annotations.

Figure 1: The multimodal annotation pipeline integrates OCR, visual captions, and ASR, generating hierarchical chapters and aligned narrative descriptions.

Each annotated chapter includes a concise short title, a detailed structural annotation (title, abstract, introduction), and a fine-grained timestamp-aligned video description, forming a hierarchical, information-dense ground truth.

Figure 2: ARC-Chapter generates three-level outputs per chapter: Short Title, Structural Chapter, and timestamp-aligned description.

The duration and topic distributions reflect the dataset’s extensiveness, essential for model generalization and robustness over various video lengths and contents.

Figure 3: Distribution of video and chapter durations within the VidAtlas dataset.

Model Architecture

ARC-Chapter builds upon Qwen2.5-VL-7B with a fixed vision encoder to maximize context-window efficiency. The architecture integrates three key input streams:

Task-specific textual prompt
Temporally sampled video frames (up to 768, with dynamic frame-token allocation)
Timestamp-aligned ASR transcripts

Inputs can flexibly include any subset (ASR-only, video-only, both), enabling robust deployment under limited modality conditions. Visual frames are encoded into tokens through Qwen2.5-VL’s vision module, while the ASR is injected as timestamped text. Adaptive modality dropping during training compels the model to leverage whichever modalities are available.

Figure 4: Schematic of ARC-Chapter's multimodal input integration and generation pipeline.

The model is instruction-tuned to generate outputs in three formats: short titles, structured chapters (title, abstract, intro), and dense timestamp-aligned descriptions. Supervised learning is performed with standard autoregressive loss, complemented by reinforcement learning with a granularity-robust temporal alignment reward.

Evaluation and Metrics

Standard event segmentation (F1, tIoU), dense video captioning metrics (CIDEr, SODA), and the newly proposed GRACE metric are used for comprehensive evaluation. Unlike SODA, which enforces one-to-one matching and penalizes annotation granularity discrepancies, GRACE adopts a many-to-one strategy, computing semantic and temporal consistency between set-matched chapter groupings, aligned via Dynamic Time Warping. This provides granularity-invariant and semantically robust assessment of both prediction quality and chapter boundary alignment.

Experimental Results

ARC-Chapter demonstrates clear state-of-the-art (SOTA) performance across multiple benchmarks:

On VidChapters-7M (English), ARC-Chapter surpasses previous SOTA (e.g., Chapter-Llama) by +14.0 F1 and +11.3 SODA in all-modal setting, with further gains on longer videos.
On VidAtlas (Chinese), ARC-Chapter improves F1 by +17.5 and more than doubles SODA compared to the strongest LLM-based competitors (e.g., Gemini-2.5-Pro).
The model proves highly data-efficient and data-scalable, with performance monotonically increasing across larger training splits—a clear scaling law not identified in prior video chaptering research.
Figure 5: Performance scaling of ARC-Chapter with increasing dataset fractions across benchmarks.

Ablation studies confirm robust gains from hierarchical annotation and reinforcement learning (GRPO), with temporal precision improvements (F1, tIoU) not coming at the expense of semantic output quality (CIDEr, SODA, GRACE).

Transferability and Qualitative Analysis

Pretraining ARC-Chapter yields strong positive transfer to downstream dense video captioning tasks, achieving new SOTA on YouCook2 and ActivityNet Captions in both segmentation and captioning metrics.

Visualizations on challenging English and Chinese finance-related videos show the model’s capability to generate structurally consistent, contextually accurate hierarchical chapters and timestamped summaries in both languages.

Figure 6: Example outputs on an English finance/cryptocurrency video, demonstrating the clarity and informativeness of chapter and description generation.

Figure 7: Multilingual generalization: hierarchical chaptering of a Chinese stablecoin discussion video.

Implications and Future Directions

ARC-Chapter establishes a new experimental paradigm for fine-grained, long-form video structuring by leveraging scalable multimodal data curation and robust hierarchical annotation. Its combination of architectural flexibility, versatile training objectives, and granularity-robust evaluation supports real-world deployment for content navigation, educational retrieval, and video summarization. Methodologically, the observed scaling law underscores the value of increasing both data volume and annotation intensity.

Potential future developments include expanding multilingual coverage, integrating specialized temporal compression for raw audio/visual signals, and extending the annotation pipeline for personalized content summarization. The granularity-robust evaluation framework (GRACE) could influence broader benchmarks in event localization and multimodal reasoning where annotation ambiguity and semantic alignment are critical.

Conclusion

ARC-Chapter advances the field of long-form video understanding through a scalable, hierarchically structured annotation and modeling framework. Empirical results across multiple benchmarks, languages, and output modalities confirm its superiority over prior methods. By demonstrating strong scaling behaviors and transferability, ARC-Chapter sets a new baseline for navigable chapter generation and hierarchical summarization of hour-long videos, paving the way for further research in multimodal temporal structuring and interactive content retrieval.