Multi-Pathway Text-Video Alignment (MPTVA)

Updated 16 December 2025

The paper introduces a framework that leverages LLM-based denoising and three complementary alignment pathways to match video segments with procedural steps.
It employs a combination of step–narration–video, long-term, and short-term text-video similarity measures, optimized with MIL-NCE losses for robust step localization.
Empirical results on procedural video benchmarks show significant improvements in step grounding, action localization, and narration alignment over previous methods.

Multi-Pathway Text-Video Alignment (MPTVA) is a framework for cross-modal alignment of procedural or instructional video content with structured textual representations of step-level instructions. It addresses the challenges of noisy, misaligned narration in instructional videos and the scarcity of temporally dense annotations by leveraging multiple complementary semantic matching pathways, guided and denoised by LLM-based step extraction. This approach yields robust pseudo-labels for downstream step localization, grounding, and retrieval tasks, surpassing prior methods on established procedural video benchmarks (Chen et al., 2024).

1. Motivation and Problem Setting

Instructional videos on platforms such as YouTube exhibit broad task coverage but rarely provide ground-truth temporal boundaries for procedural steps. Existing models relying on Automatic Speech Recognition (ASR) transcripts face significant limitations: narrations may include task-irrelevant content, and timestamps are frequently unreliable due to misalignment between spoken and performed actions. MPTVA mitigates these problems by:

Extracting a filtered, procedural step list ("LLM-steps") using a prompt-driven LLM (e.g., LLaMA 2-7B) to summarize and denoise narration.
Creating robust step-to-segment pseudo-labels by aggregating evidence from three distinct matching pathways, each capturing complementary alignment signals between LLM-steps and video segments.

2. Formal Pathway Definitions

Let $S = \{s_1,\ldots,s_L\}$ denote LLM-extracted procedural steps, $V = \{v_1,\ldots,v_T\}$ the sequence of $T$ short video segments ( $\approx1$ sec/segment), and $N = \{n_1,\ldots,n_K\}$ the set of ASR narrator segments with timestamps. Encoders are denoted by $E_t(\cdot)$ (text) and $E_v(\cdot)$ (video), with all outputs L2-normalized. A temperature parameter $\tau$ is used for softmax operations.

2.1. Step–Narration–Video (S→N→V) Alignment

Compute a soft alignment matrix $A_{SN} \in \mathbb{R}^{L \times K}$ by:

$A_{SN}[i,k] = \mathrm{Softmax}_k \left(\frac{E_t(s_i) \cdot E_t(n_k)}{\tau}\right)$

Convert narration timestamps to a binary matrix $Y^{NV} \in \{0,1\}^{K \times T}$ : $Y^{NV}_{k,t} = 1$ iff $v_t$ is within $n_k$ ’s timestamp.
The S→N→V score is $A_{SNV} = A_{SN} \cdot Y^{NV} \in \mathbb{R}^{L \times T}$ .
Loss: Apply Multiple Instance Learning Noise-Contrastive Estimation (MIL-NCE):

$L_{SNV} = -\frac{1}{L}\sum_{i=1}^{L}\sum_{t=1}^T Y^{SNV}_{i,t} \log \frac{\exp(\hat{A}_{i,t}/\eta)}{\sum_{t'}\exp(\hat{A}_{i,t'}/\eta)}$

where $\hat{A}$ is the predicted similarity, $\eta$ is an additional temperature.

2.2. Direct Step–Video Long-Term (S–V $^{long}$ ) Similarity

Use a text-video model pretrained on long instructional videos, with encoders $E_t^L$ , $E_v^L$ :

$A_{SV}^{long}[i,t] = E_t^L(s_i) \cdot E_v^L(v_t)$

MIL-NCE (long-term) loss:

$L_{lt} = -\frac{1}{L}\sum_{i=1}^{L} \log \frac{\exp(\mathrm{sim}_{lt}(s_i,v_{t_i^+})/\tau)}{\sum_{t}\exp(\mathrm{sim}_{lt}(s_i, v_t)/\tau)}$

where $t_i^+$ is the index of the top-matching video segment for $s_i$ .

2.3. Direct Step–Video Short-Term (S–V $^{short}$ ) Fine-Grained Similarity

Use short-video foundation models (e.g., InternVideo): $E_t^S$ , $E_v^S$ :

$A_{SV}^{short}[i,t] = E_t^S(s_i) \cdot E_v^S(v_t)$

Apply MIL-NCE loss $L_{st}$ in the same form as above.

3. Pathway Fusion and Pseudo-Label Generation

The final step-video matching matrix $M \in \mathbb{R}^{L \times T}$ is formed by averaging the three pathway scores:

$M(s_i, v_t) = \frac{A_{SNV}[i, t] + A_{SV}^{long}[i, t] + A_{SV}^{short}[i, t]}{3}$

Optionally, weights $w_1, w_2, w_3$ can be learned, subject to $w_1+w_2+w_3=1$ . Pseudo-labels $Y^{SV}$ are computed by selecting, for each $s_i$ , the top-matching segment $v_{t_i^*}$ (possibly including neighbors in a window $W$ ) where $M(s_i, v_{t_i^*}) > \gamma$ , with $\gamma$ a similarity threshold.

4. LLM-Based Step Extraction and Pre-Processing

LLM-based denoising is performed by prompting LLaMA 2-7B to summarize and filter procedure steps from chunked ASR narrations. The prompt instructs the LLM to extract task-relevant steps and discard colloquial or irrelevant content. Narration sentences are chunked into groups of 10, prompted, and all resulting step outputs are concatenated, forming $S = \{s_1,\ldots,s_L\}$ per video. This yields a structured, denoised step sequence directly suitable for alignment.

5. Model Architecture and Optimization

Video backbone: Frozen S3D network (16 FPS), producing 1-sec segments, projected to 256 dimensions.
Text backbone: Bag-of-Words Word2Vec, projected to 256 dimensions.
Unimodal encoders: 2-layer Transformers (256 D, 8 heads) for both video ( $f_{trs}$ ) with positional embeddings, and text ( $g_{trs}$ ).
Joint encoder: 2-layer Transformer (256 D) operating on concatenated segment and step tokens.
Similarity computation: Final cosine similarity between 256 D output embeddings for segments and steps.
Optimization: AdamW (lr $=2\times10^{-4}$ , weight decay $=1\times10^{-5}$ ), batch size 32, 12 epochs, cosine-decay learning rate schedule.
Training loss: Combined MIL-NCE across all three pathways:

$L = \alpha L_{SNV} + \beta L_{lt} + \gamma L_{st}$

Equally weighted coefficients ( $\alpha = \beta = \gamma = 1$ ) were found effective in practice.

6. Empirical Results and Comparison

MPTVA was evaluated on three key procedural video tasks, outperforming previous state-of-the-art models such as VINA. The table below summarizes the reported gains (Chen et al., 2024):

Task	Metric	VINA	MPTVA	Δ
Procedure Step Grounding (HT-Step)	R@1 (%)	37.4	43.3	+5.9
Action Step Localization (CrossTask)	Avg R@1 (%)	44.8	47.9	+3.1
Narration Grounding (HTM-Align)	R@1 (%)	66.5	69.3	+2.8

These improvements establish MPTVA's effectiveness in generating robust pseudo-labels for fine-grained temporal action localization.

Parallel research in multi-stream alignment for video-text retrieval employs complementary perspectives (e.g., Fusion/Entity/Action decomposition) within a dual-softmax, multi-expert architecture (CAMoE), achieving state-of-the-art retrieval performance across large video-text datasets (Cheng et al., 2021). In these settings, multi-pathway modeling mitigates overfitting and content heterogeneity, with each expert capturing distinct semantic granularity. The MPTVA principle similarly exploits complementary alignment signals—here, through timestamped, global, and local alignment modeling—filtered by LLMs for robust cross-modal supervision. This suggests that multi-pathway alignment is consistently advantageous for complex cross-modal retrieval and localization settings.

In sum, Multi-Pathway Text-Video Alignment (MPTVA) advances the state of temporal action localization and procedure understanding in instructional videos by synthesizing complementary alignment mechanisms, leveraging LLM-denoised steps, and providing reliable supervisory signals for dense, fine-grained text-video grounding (Chen et al., 2024).