- The paper introduces a text-domain chaptering method that leverages finetuned LLMs to jointly predict chapter boundaries and titles, outperforming prior state-of-the-art techniques.
- It employs a speech-guided frame selection strategy, dramatically reducing processed frames and computational cost while maintaining semantic accuracy.
- Iterative prediction with interleaved ASR transcripts and frame captions scales the method for hour-long videos, achieving notable improvements on VidChapters-7M.
The paper "Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs" (2504.00072) introduces a framework for automatic video chaptering, defined as the dual task of temporally segmenting long videos into semantic units and generating descriptive titles for these segments. The core contribution lies in reformulating the problem within the text domain, enabling the application of LLMs with extensive context windows to process hour-long videos efficiently.
Methodology: Text-Domain Chaptering with LLMs
Chapter-Llama circumvents direct video feature processing by converting multimodal video information into a unified text sequence. This sequence serves as input to a finetuned LLM.
Input Representation: The framework relies on two primary data streams, each augmented with timestamps in HH:MM:SS format:
- Automatic Speech Recognition (ASR) Transcripts: Generated using Whisper-Large-V2 via WhisperX, providing the spoken content along with utterance start times.
- Frame Captions: Descriptive text for specific video frames, generated using the MiniCPM-V model.
Speech-Guided Frame Selection: A key aspect for efficiency is the strategy to select frames for captioning, avoiding the prohibitive cost of processing all frames. The authors propose a lightweight, speech-guided method:
- A preliminary, speech-only variant of the chaptering LLM is trained using only ASR transcripts to predict coarse chapter boundaries.
- Video frames are sampled exclusively at the timestamps corresponding to these predicted boundaries. This significantly reduces the number of frames requiring captioning (averaging 10.3 frames per video on VidChapters-7M).
- For videos lacking speech content (approximately 3% in VidChapters-7M), a fallback mechanism samples frames uniformly at 10-second intervals, capped at 100 frames.
This targeted selection contrasts sharply with naive equidistant sampling or dense captioning, drastically lowering the computational burden associated with the vision component.
LLM Processing and Finetuning:
- The timestamped ASR transcripts and selected frame captions are chronologically interleaved and formatted into a single text sequence. A task-specific instruction prompt precedes this sequence.
- The framework employs Llama-3.1-8B-Instruct as the base LLM. Crucially, this model is finetuned specifically for the chaptering task using Low-Rank Adaptation (LoRA). Finetuning adapts the LLM to discern relevant semantic boundaries and adhere to the structured output format required for chaptering.
- The LLM is trained to jointly predict chapter boundaries and titles as a single output sequence, formatted as:
1
2
3
|
HH:MM:SS - Chapter Title 1
HH:MM:SS - Chapter Title 2
... |
Handling Extended Context Lengths: For videos whose textual representation exceeds the LLM's operational context window (e.g., >15k tokens during training, >25k during inference), an iterative prediction strategy is implemented. The input text is segmented into overlapping chunks (e.g., 15k or 20k tokens). The LLM processes each chunk sequentially, and the resulting chapter predictions are merged to produce the final output for the entire video. This allows the model to scale beyond its inherent context length limitation.
Experimental Evaluation and Results
The performance of Chapter-Llama was evaluated on the VidChapters-7M benchmark, comparing primarily against the previous state-of-the-art, Vid2Seq. Vid2Seq utilizes a transformer architecture processing sampled frame features (100 equidistant frames) and ASR transcripts.
Quantitative Performance: Chapter-Llama demonstrates substantial improvements across all standard metrics.
Metric |
Task |
Vid2Seq (Baseline) |
Chapter-Llama (Ours) |
Improvement |
F1 Score |
Boundary+Title |
26.7 |
45.3 |
+18.6 |
tIoU (0.5) |
Boundary |
40.1 |
58.4 |
+18.3 |
SODA |
Title Generation |
38.7 |
52.0 |
+13.3 |
CIDEr |
Title Generation |
49.6 |
69.5 |
+19.9 |
The gains are particularly notable for longer videos (medium: 15-30 min, long: 30-60 min), where context understanding is more critical.
Zero-Shot Capability: Even without finetuning on the VidChapters-7M dataset, the base Llama-3.1-8B-Instruct model, when prompted appropriately with the multimodal text input, achieved an F1 score of 29.5. This result surpassed the fully finetuned Vid2Seq baseline (26.7 F1), underscoring the inherent capabilities of modern LLMs for this task, albeit significantly enhanced by task-specific finetuning.
Ablation Studies: Key findings include:
- Finetuning: Essential for optimal performance, significantly improving over the zero-shot configuration.
- Multimodality: Combining ASR transcripts and frame captions yielded the best results, outperforming unimodal variants (speech-only or captions-only).
- Frame Selection: The proposed speech-guided frame selection outperformed alternative strategies (100 equidistant frames, 10-second interval sampling, shot boundary sampling) while being computationally cheaper due to processing far fewer frames.
- Iterative Prediction: Effectively handles videos exceeding the context window, showing improved performance compared to simple input truncation.
Implementation Considerations
Deploying Chapter-Llama involves several components:
- ASR System: Requires a robust ASR model like Whisper for accurate transcript generation and timestamp alignment.
- Image Captioning Model: Needs an efficient image captioning model (e.g., MiniCPM-V) for generating textual descriptions of selected frames. The cost here is mitigated by the sparse frame selection strategy.
- LLM: Utilizes a large context window LLM (Llama-3.1-8B-Instruct). Finetuning via LoRA is necessary for optimal performance, requiring annotated chaptering data (like VidChapters-7M) and appropriate compute resources (GPU memory for training/inference).
- Preprocessing Pipeline: Involves orchestrating ASR, the initial speech-only chapter prediction, frame extraction based on predicted boundaries, image captioning, and finally, formatting the interleaved text sequence for the main LLM.
- Inference Handling: Must implement the iterative prediction logic for videos exceeding the chosen context window length during inference.
The system's efficiency largely hinges on the speech-guided frame selection, which drastically reduces the invocations of the potentially expensive image captioning model compared to dense or uniform sampling approaches.
Conclusion
Chapter-Llama presents an effective and computationally viable approach for automatic video chaptering in long-form content by reformulating the task within the text domain. Its core innovations include the use of a finetuned large context window LLM processing interleaved ASR transcripts and frame captions, and critically, a highly efficient speech-guided frame selection strategy. The framework establishes a new state-of-the-art on the VidChapters-7M benchmark, significantly outperforming prior methods, particularly on hour-long videos, demonstrating a promising direction for navigating and understanding extensive video repositories.