Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 92 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs (2504.00072v1)

Published 31 Mar 2025 in cs.CV

Abstract: We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. While relatively underexplored, automatic chaptering has the potential to enable efficient navigation and content retrieval in long-form videos. In this paper, we achieve strong chaptering performance on hour-long videos by efficiently addressing the problem in the text domain with our 'Chapter-Llama' framework. Specifically, we leverage a pretrained LLM with large context window, and feed as input (i) speech transcripts and (ii) captions describing video frames, along with their respective timestamps. Given the inefficiency of exhaustively captioning all frames, we propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. We train the LLM to output timestamps for the chapter boundaries, as well as free-form chapter titles. This simple yet powerful approach scales to processing one-hour long videos in a single forward pass. Our results demonstrate substantial improvements (e.g., 45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M benchmark. To promote further research, we release our code and models at our project page.

Summary

The paper introduces a text-domain chaptering method that leverages finetuned LLMs to jointly predict chapter boundaries and titles, outperforming prior state-of-the-art techniques.
It employs a speech-guided frame selection strategy, dramatically reducing processed frames and computational cost while maintaining semantic accuracy.
Iterative prediction with interleaved ASR transcripts and frame captions scales the method for hour-long videos, achieving notable improvements on VidChapters-7M.

The paper "Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs" (2504.00072) introduces a framework for automatic video chaptering, defined as the dual task of temporally segmenting long videos into semantic units and generating descriptive titles for these segments. The core contribution lies in reformulating the problem within the text domain, enabling the application of LLMs with extensive context windows to process hour-long videos efficiently.

Methodology: Text-Domain Chaptering with LLMs

Chapter-Llama circumvents direct video feature processing by converting multimodal video information into a unified text sequence. This sequence serves as input to a finetuned LLM.

Input Representation: The framework relies on two primary data streams, each augmented with timestamps in HH:MM:SS format:

Automatic Speech Recognition (ASR) Transcripts: Generated using Whisper-Large-V2 via WhisperX, providing the spoken content along with utterance start times.
Frame Captions: Descriptive text for specific video frames, generated using the MiniCPM-V model.

Speech-Guided Frame Selection: A key aspect for efficiency is the strategy to select frames for captioning, avoiding the prohibitive cost of processing all frames. The authors propose a lightweight, speech-guided method:

A preliminary, speech-only variant of the chaptering LLM is trained using only ASR transcripts to predict coarse chapter boundaries.
Video frames are sampled exclusively at the timestamps corresponding to these predicted boundaries. This significantly reduces the number of frames requiring captioning (averaging 10.3 frames per video on VidChapters-7M).
For videos lacking speech content (approximately 3% in VidChapters-7M), a fallback mechanism samples frames uniformly at 10-second intervals, capped at 100 frames.

This targeted selection contrasts sharply with naive equidistant sampling or dense captioning, drastically lowering the computational burden associated with the vision component.

LLM Processing and Finetuning:

The timestamped ASR transcripts and selected frame captions are chronologically interleaved and formatted into a single text sequence. A task-specific instruction prompt precedes this sequence.
The framework employs Llama-3.1-8B-Instruct as the base LLM. Crucially, this model is finetuned specifically for the chaptering task using Low-Rank Adaptation (LoRA). Finetuning adapts the LLM to discern relevant semantic boundaries and adhere to the structured output format required for chaptering.
The LLM is trained to jointly predict chapter boundaries and titles as a single output sequence, formatted as:
1 2 3
HH:MM:SS - Chapter Title 1 HH:MM:SS - Chapter Title 2 ...

Handling Extended Context Lengths: For videos whose textual representation exceeds the LLM's operational context window (e.g., >15k tokens during training, >25k during inference), an iterative prediction strategy is implemented. The input text is segmented into overlapping chunks (e.g., 15k or 20k tokens). The LLM processes each chunk sequentially, and the resulting chapter predictions are merged to produce the final output for the entire video. This allows the model to scale beyond its inherent context length limitation.

Experimental Evaluation and Results

The performance of Chapter-Llama was evaluated on the VidChapters-7M benchmark, comparing primarily against the previous state-of-the-art, Vid2Seq. Vid2Seq utilizes a transformer architecture processing sampled frame features (100 equidistant frames) and ASR transcripts.

Quantitative Performance: Chapter-Llama demonstrates substantial improvements across all standard metrics.

Metric	Task	Vid2Seq (Baseline)	Chapter-Llama (Ours)	Improvement
F1 Score	Boundary+Title	26.7	45.3	+18.6
tIoU (0.5)	Boundary	40.1	58.4	+18.3
SODA	Title Generation	38.7	52.0	+13.3
CIDEr	Title Generation	49.6	69.5	+19.9

The gains are particularly notable for longer videos (medium: 15-30 min, long: 30-60 min), where context understanding is more critical.

Zero-Shot Capability: Even without finetuning on the VidChapters-7M dataset, the base Llama-3.1-8B-Instruct model, when prompted appropriately with the multimodal text input, achieved an F1 score of 29.5. This result surpassed the fully finetuned Vid2Seq baseline (26.7 F1), underscoring the inherent capabilities of modern LLMs for this task, albeit significantly enhanced by task-specific finetuning.

Ablation Studies: Key findings include:

Finetuning: Essential for optimal performance, significantly improving over the zero-shot configuration.
Multimodality: Combining ASR transcripts and frame captions yielded the best results, outperforming unimodal variants (speech-only or captions-only).
Frame Selection: The proposed speech-guided frame selection outperformed alternative strategies (100 equidistant frames, 10-second interval sampling, shot boundary sampling) while being computationally cheaper due to processing far fewer frames.
Iterative Prediction: Effectively handles videos exceeding the context window, showing improved performance compared to simple input truncation.

Implementation Considerations

Deploying Chapter-Llama involves several components:

ASR System: Requires a robust ASR model like Whisper for accurate transcript generation and timestamp alignment.
Image Captioning Model: Needs an efficient image captioning model (e.g., MiniCPM-V) for generating textual descriptions of selected frames. The cost here is mitigated by the sparse frame selection strategy.
LLM: Utilizes a large context window LLM (Llama-3.1-8B-Instruct). Finetuning via LoRA is necessary for optimal performance, requiring annotated chaptering data (like VidChapters-7M) and appropriate compute resources (GPU memory for training/inference).
Preprocessing Pipeline: Involves orchestrating ASR, the initial speech-only chapter prediction, frame extraction based on predicted boundaries, image captioning, and finally, formatting the interleaved text sequence for the main LLM.
Inference Handling: Must implement the iterative prediction logic for videos exceeding the chosen context window length during inference.

The system's efficiency largely hinges on the speech-guided frame selection, which drastically reduces the invocations of the potentially expensive image captioning model compared to dense or uniform sampling approaches.

Conclusion

Chapter-Llama presents an effective and computationally viable approach for automatic video chaptering in long-form content by reformulating the task within the text domain. Its core innovations include the use of a finetuned large context window LLM processing interleaved ASR transcripts and frame captions, and critically, a highly efficient speech-guided frame selection strategy. The framework establishes a new state-of-the-art on the VidChapters-7M benchmark, significantly outperforming prior methods, particularly on hour-long videos, demonstrating a promising direction for navigating and understanding extensive video repositories.