Multi-Pathway Text-Video Alignment (MPTVA)
- The paper introduces a framework that leverages LLM-based denoising and three complementary alignment pathways to match video segments with procedural steps.
- It employs a combination of step–narration–video, long-term, and short-term text-video similarity measures, optimized with MIL-NCE losses for robust step localization.
- Empirical results on procedural video benchmarks show significant improvements in step grounding, action localization, and narration alignment over previous methods.
Multi-Pathway Text-Video Alignment (MPTVA) is a framework for cross-modal alignment of procedural or instructional video content with structured textual representations of step-level instructions. It addresses the challenges of noisy, misaligned narration in instructional videos and the scarcity of temporally dense annotations by leveraging multiple complementary semantic matching pathways, guided and denoised by LLM-based step extraction. This approach yields robust pseudo-labels for downstream step localization, grounding, and retrieval tasks, surpassing prior methods on established procedural video benchmarks (Chen et al., 2024).
1. Motivation and Problem Setting
Instructional videos on platforms such as YouTube exhibit broad task coverage but rarely provide ground-truth temporal boundaries for procedural steps. Existing models relying on Automatic Speech Recognition (ASR) transcripts face significant limitations: narrations may include task-irrelevant content, and timestamps are frequently unreliable due to misalignment between spoken and performed actions. MPTVA mitigates these problems by:
- Extracting a filtered, procedural step list ("LLM-steps") using a prompt-driven LLM (e.g., LLaMA 2-7B) to summarize and denoise narration.
- Creating robust step-to-segment pseudo-labels by aggregating evidence from three distinct matching pathways, each capturing complementary alignment signals between LLM-steps and video segments.
2. Formal Pathway Definitions
Let denote LLM-extracted procedural steps, the sequence of short video segments ( sec/segment), and the set of ASR narrator segments with timestamps. Encoders are denoted by (text) and (video), with all outputs L2-normalized. A temperature parameter is used for softmax operations.
2.1. Step–Narration–Video (S→N→V) Alignment
- Compute a soft alignment matrix by:
- Convert narration timestamps to a binary matrix : iff is within ’s timestamp.
- The S→N→V score is .
- Loss: Apply Multiple Instance Learning Noise-Contrastive Estimation (MIL-NCE):
where is the predicted similarity, is an additional temperature.
2.2. Direct Step–Video Long-Term (S–V) Similarity
- Use a text-video model pretrained on long instructional videos, with encoders , :
- MIL-NCE (long-term) loss:
where is the index of the top-matching video segment for .
2.3. Direct Step–Video Short-Term (S–V) Fine-Grained Similarity
- Use short-video foundation models (e.g., InternVideo): , :
- Apply MIL-NCE loss in the same form as above.
3. Pathway Fusion and Pseudo-Label Generation
The final step-video matching matrix is formed by averaging the three pathway scores:
Optionally, weights can be learned, subject to . Pseudo-labels are computed by selecting, for each , the top-matching segment (possibly including neighbors in a window ) where , with a similarity threshold.
4. LLM-Based Step Extraction and Pre-Processing
LLM-based denoising is performed by prompting LLaMA 2-7B to summarize and filter procedure steps from chunked ASR narrations. The prompt instructs the LLM to extract task-relevant steps and discard colloquial or irrelevant content. Narration sentences are chunked into groups of 10, prompted, and all resulting step outputs are concatenated, forming per video. This yields a structured, denoised step sequence directly suitable for alignment.
5. Model Architecture and Optimization
- Video backbone: Frozen S3D network (16 FPS), producing 1-sec segments, projected to 256 dimensions.
- Text backbone: Bag-of-Words Word2Vec, projected to 256 dimensions.
- Unimodal encoders: 2-layer Transformers (256 D, 8 heads) for both video () with positional embeddings, and text ().
- Joint encoder: 2-layer Transformer (256 D) operating on concatenated segment and step tokens.
- Similarity computation: Final cosine similarity between 256 D output embeddings for segments and steps.
- Optimization: AdamW (lr, weight decay), batch size 32, 12 epochs, cosine-decay learning rate schedule.
- Training loss: Combined MIL-NCE across all three pathways:
Equally weighted coefficients () were found effective in practice.
6. Empirical Results and Comparison
MPTVA was evaluated on three key procedural video tasks, outperforming previous state-of-the-art models such as VINA. The table below summarizes the reported gains (Chen et al., 2024):
| Task | Metric | VINA | MPTVA | Δ |
|---|---|---|---|---|
| Procedure Step Grounding (HT-Step) | R@1 (%) | 37.4 | 43.3 | +5.9 |
| Action Step Localization (CrossTask) | Avg R@1 (%) | 44.8 | 47.9 | +3.1 |
| Narration Grounding (HTM-Align) | R@1 (%) | 66.5 | 69.3 | +2.8 |
These improvements establish MPTVA's effectiveness in generating robust pseudo-labels for fine-grained temporal action localization.
7. Related Approaches and Multi-Pathway Variants
Parallel research in multi-stream alignment for video-text retrieval employs complementary perspectives (e.g., Fusion/Entity/Action decomposition) within a dual-softmax, multi-expert architecture (CAMoE), achieving state-of-the-art retrieval performance across large video-text datasets (Cheng et al., 2021). In these settings, multi-pathway modeling mitigates overfitting and content heterogeneity, with each expert capturing distinct semantic granularity. The MPTVA principle similarly exploits complementary alignment signals—here, through timestamped, global, and local alignment modeling—filtered by LLMs for robust cross-modal supervision. This suggests that multi-pathway alignment is consistently advantageous for complex cross-modal retrieval and localization settings.
In sum, Multi-Pathway Text-Video Alignment (MPTVA) advances the state of temporal action localization and procedure understanding in instructional videos by synthesizing complementary alignment mechanisms, leveraging LLM-denoised steps, and providing reliable supervisory signals for dense, fine-grained text-video grounding (Chen et al., 2024).