Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Pathway Text-Video Alignment (MPTVA)

Updated 16 December 2025
  • The paper introduces a framework that leverages LLM-based denoising and three complementary alignment pathways to match video segments with procedural steps.
  • It employs a combination of step–narration–video, long-term, and short-term text-video similarity measures, optimized with MIL-NCE losses for robust step localization.
  • Empirical results on procedural video benchmarks show significant improvements in step grounding, action localization, and narration alignment over previous methods.

Multi-Pathway Text-Video Alignment (MPTVA) is a framework for cross-modal alignment of procedural or instructional video content with structured textual representations of step-level instructions. It addresses the challenges of noisy, misaligned narration in instructional videos and the scarcity of temporally dense annotations by leveraging multiple complementary semantic matching pathways, guided and denoised by LLM-based step extraction. This approach yields robust pseudo-labels for downstream step localization, grounding, and retrieval tasks, surpassing prior methods on established procedural video benchmarks (Chen et al., 2024).

1. Motivation and Problem Setting

Instructional videos on platforms such as YouTube exhibit broad task coverage but rarely provide ground-truth temporal boundaries for procedural steps. Existing models relying on Automatic Speech Recognition (ASR) transcripts face significant limitations: narrations may include task-irrelevant content, and timestamps are frequently unreliable due to misalignment between spoken and performed actions. MPTVA mitigates these problems by:

  • Extracting a filtered, procedural step list ("LLM-steps") using a prompt-driven LLM (e.g., LLaMA 2-7B) to summarize and denoise narration.
  • Creating robust step-to-segment pseudo-labels by aggregating evidence from three distinct matching pathways, each capturing complementary alignment signals between LLM-steps and video segments.

2. Formal Pathway Definitions

Let S={s1,,sL}S = \{s_1,\ldots,s_L\} denote LLM-extracted procedural steps, V={v1,,vT}V = \{v_1,\ldots,v_T\} the sequence of TT short video segments (1\approx1 sec/segment), and N={n1,,nK}N = \{n_1,\ldots,n_K\} the set of ASR narrator segments with timestamps. Encoders are denoted by Et()E_t(\cdot) (text) and Ev()E_v(\cdot) (video), with all outputs L2-normalized. A temperature parameter τ\tau is used for softmax operations.

2.1. Step–Narration–Video (S→N→V) Alignment

  • Compute a soft alignment matrix ASNRL×KA_{SN} \in \mathbb{R}^{L \times K} by:

ASN[i,k]=Softmaxk(Et(si)Et(nk)τ)A_{SN}[i,k] = \mathrm{Softmax}_k \left(\frac{E_t(s_i) \cdot E_t(n_k)}{\tau}\right)

  • Convert narration timestamps to a binary matrix YNV{0,1}K×TY^{NV} \in \{0,1\}^{K \times T}: Yk,tNV=1Y^{NV}_{k,t} = 1 iff vtv_t is within nkn_k’s timestamp.
  • The S→N→V score is ASNV=ASNYNVRL×TA_{SNV} = A_{SN} \cdot Y^{NV} \in \mathbb{R}^{L \times T}.
  • Loss: Apply Multiple Instance Learning Noise-Contrastive Estimation (MIL-NCE):

LSNV=1Li=1Lt=1TYi,tSNVlogexp(A^i,t/η)texp(A^i,t/η)L_{SNV} = -\frac{1}{L}\sum_{i=1}^{L}\sum_{t=1}^T Y^{SNV}_{i,t} \log \frac{\exp(\hat{A}_{i,t}/\eta)}{\sum_{t'}\exp(\hat{A}_{i,t'}/\eta)}

where A^\hat{A} is the predicted similarity, η\eta is an additional temperature.

2.2. Direct Step–Video Long-Term (S–Vlong^{long}) Similarity

  • Use a text-video model pretrained on long instructional videos, with encoders EtLE_t^L, EvLE_v^L:

ASVlong[i,t]=EtL(si)EvL(vt)A_{SV}^{long}[i,t] = E_t^L(s_i) \cdot E_v^L(v_t)

  • MIL-NCE (long-term) loss:

Llt=1Li=1Llogexp(simlt(si,vti+)/τ)texp(simlt(si,vt)/τ)L_{lt} = -\frac{1}{L}\sum_{i=1}^{L} \log \frac{\exp(\mathrm{sim}_{lt}(s_i,v_{t_i^+})/\tau)}{\sum_{t}\exp(\mathrm{sim}_{lt}(s_i, v_t)/\tau)}

where ti+t_i^+ is the index of the top-matching video segment for sis_i.

2.3. Direct Step–Video Short-Term (S–Vshort^{short}) Fine-Grained Similarity

  • Use short-video foundation models (e.g., InternVideo): EtSE_t^S, EvSE_v^S:

ASVshort[i,t]=EtS(si)EvS(vt)A_{SV}^{short}[i,t] = E_t^S(s_i) \cdot E_v^S(v_t)

  • Apply MIL-NCE loss LstL_{st} in the same form as above.

3. Pathway Fusion and Pseudo-Label Generation

The final step-video matching matrix MRL×TM \in \mathbb{R}^{L \times T} is formed by averaging the three pathway scores:

M(si,vt)=ASNV[i,t]+ASVlong[i,t]+ASVshort[i,t]3M(s_i, v_t) = \frac{A_{SNV}[i, t] + A_{SV}^{long}[i, t] + A_{SV}^{short}[i, t]}{3}

Optionally, weights w1,w2,w3w_1, w_2, w_3 can be learned, subject to w1+w2+w3=1w_1+w_2+w_3=1. Pseudo-labels YSVY^{SV} are computed by selecting, for each sis_i, the top-matching segment vtiv_{t_i^*} (possibly including neighbors in a window WW) where M(si,vti)>γM(s_i, v_{t_i^*}) > \gamma, with γ\gamma a similarity threshold.

4. LLM-Based Step Extraction and Pre-Processing

LLM-based denoising is performed by prompting LLaMA 2-7B to summarize and filter procedure steps from chunked ASR narrations. The prompt instructs the LLM to extract task-relevant steps and discard colloquial or irrelevant content. Narration sentences are chunked into groups of 10, prompted, and all resulting step outputs are concatenated, forming S={s1,,sL}S = \{s_1,\ldots,s_L\} per video. This yields a structured, denoised step sequence directly suitable for alignment.

5. Model Architecture and Optimization

  • Video backbone: Frozen S3D network (16 FPS), producing 1-sec segments, projected to 256 dimensions.
  • Text backbone: Bag-of-Words Word2Vec, projected to 256 dimensions.
  • Unimodal encoders: 2-layer Transformers (256 D, 8 heads) for both video (ftrsf_{trs}) with positional embeddings, and text (gtrsg_{trs}).
  • Joint encoder: 2-layer Transformer (256 D) operating on concatenated segment and step tokens.
  • Similarity computation: Final cosine similarity between 256 D output embeddings for segments and steps.
  • Optimization: AdamW (lr=2×104=2\times10^{-4}, weight decay=1×105=1\times10^{-5}), batch size 32, 12 epochs, cosine-decay learning rate schedule.
  • Training loss: Combined MIL-NCE across all three pathways:

L=αLSNV+βLlt+γLstL = \alpha L_{SNV} + \beta L_{lt} + \gamma L_{st}

Equally weighted coefficients (α=β=γ=1\alpha = \beta = \gamma = 1) were found effective in practice.

6. Empirical Results and Comparison

MPTVA was evaluated on three key procedural video tasks, outperforming previous state-of-the-art models such as VINA. The table below summarizes the reported gains (Chen et al., 2024):

Task Metric VINA MPTVA Δ
Procedure Step Grounding (HT-Step) R@1 (%) 37.4 43.3 +5.9
Action Step Localization (CrossTask) Avg R@1 (%) 44.8 47.9 +3.1
Narration Grounding (HTM-Align) R@1 (%) 66.5 69.3 +2.8

These improvements establish MPTVA's effectiveness in generating robust pseudo-labels for fine-grained temporal action localization.

Parallel research in multi-stream alignment for video-text retrieval employs complementary perspectives (e.g., Fusion/Entity/Action decomposition) within a dual-softmax, multi-expert architecture (CAMoE), achieving state-of-the-art retrieval performance across large video-text datasets (Cheng et al., 2021). In these settings, multi-pathway modeling mitigates overfitting and content heterogeneity, with each expert capturing distinct semantic granularity. The MPTVA principle similarly exploits complementary alignment signals—here, through timestamped, global, and local alignment modeling—filtered by LLMs for robust cross-modal supervision. This suggests that multi-pathway alignment is consistently advantageous for complex cross-modal retrieval and localization settings.


In sum, Multi-Pathway Text-Video Alignment (MPTVA) advances the state of temporal action localization and procedure understanding in instructional videos by synthesizing complementary alignment mechanisms, leveraging LLM-denoised steps, and providing reliable supervisory signals for dense, fine-grained text-video grounding (Chen et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Pathway Text-Video Alignment (MPTVA).