Papers
Topics
Authors
Recent
Search
2000 character limit reached

Causality-Aware Temporal Projector (CATP)

Updated 12 January 2026
  • CATP is a video understanding module applying temporal constraints to enhance causality perception.
  • Alternative approaches allow backward influence which CATP overcomes using a block-causal attention mask.
  • CATP offers measurable gains in benchmark accuracy, enhancing temporal and causal reasoning capabilities.

The Causality-Aware Temporal Projector (CATP) is a transformer-based module designed for video understanding within Video LLMs (Video-LLMs) that require precise modeling of temporal ordering and causality. CATP was proposed in the V-CORE framework and explicitly constrains temporal information flow, addressing limitations of standard bidirectional attention mechanisms in existing video-language architectures. CATP achieves parameter efficiency, maintains unbroken intra-frame spatial interactions, and provides measurable improvements on benchmarks requiring temporal and causal reasoning (Kang et al., 5 Jan 2026).

1. Motivation and Challenges in Temporal Video Modeling

Many Video-LLMs utilize bidirectional transformer projectors to capture inter-frame dependencies. While effective for general multimodal reasoning, these projectors allow future frame features to influence earlier event representations, undermining the “arrow of time” crucial for causal questions, e.g., “what happened first?” or “which event caused X?”. This bidirectional leakage leads to ambiguity in tasks where temporal order and causal attribution are essential, as exemplified by failures on causal and temporal components of the NExT-QA benchmark. CATP directly addresses this by enforcing unidirectional, frame-by-frame aggregation:

  • Full bidirectional processing is preserved within each frame to maintain spatial token interactions.
  • A block-causal attention mask ensures tokens from frame tt can attend only to the same or previous frames (t\leq t), explicitly preventing backward leakage of information.
  • A learnable terminal summary token, appended to the sequence, acts exclusively as a “causal sink,” aggregating global context without the ability to overwrite past frame representations.

2. Integration of CATP Within the V-CORE Framework

CATP is one of two central modules within the V-CORE architecture; the other is Learnable Spatial Aggregation (LSA). The workflow is as follows:

  1. Frame Encoding: Each input video frame is encoded by a frozen ViT-L/14, yielding N=256N=256 patch embeddings per frame, FtRN×dvF_t\in\mathbb{R}^{N\times d_v}.
  2. Learnable Spatial Aggregation (LSA): Each FtF_t is pooled to KK salient tokens via learnable queries:

Qs=QWQ,Ks=FtWK,Vs=FtWVQ_s = \mathbf Q W^Q, \quad K_s = F_t W^K, \quad V_s = F_t W^V

Ht=Softmax(QsKsTdv)Vs    RK×dvH_t = \mathrm{Softmax}\left(\frac{Q_s K_s^T}{\sqrt{d_v}}\right)V_s \;\in\; \mathbb{R}^{K \times d_v}

Typically, K=16K=16, significantly reducing patch redundancy.

  1. Sequence Formation: Flatten all TT frames’ tokens and append the learnable summary token qsum\mathbf q_{\mathrm{sum}}, forming HinR(TK+1)×dvH_{\mathrm{in}}\in\mathbb{R}^{(TK+1)\times d_v}.
  2. Spatio-temporal Positional Encoding: Joint positional embeddings PR(TK+1)×dv\mathbf P\in\mathbb{R}^{(TK+1)\times d_v} are added so each token encodes its global frame index and intra-frame position.
  3. CATP Transformer: The resulting input is processed by stacked CATP transformer layers, embedding explicit temporal constraints.
  4. Projection and LLM Input: The (TK+1)(TK+1) output tokens are projected into the LLM embedding space and prepended to the textual instruction for subsequent language generation or question answering.

3. Block-Causal Attention and Summary Token Dynamics

The core of CATP is its spatio-temporal block-causal masking within the transformer self-attention. The block-causal mask MM is constructed such that:

Mij={0,if iKjK ,otherwiseM_{ij} = \begin{cases} 0, & \text{if } \left\lfloor \frac{i}{K} \right\rfloor \geq \left\lfloor \frac{j}{K} \right\rfloor \ -\infty, & \text{otherwise} \end{cases}

This configuration yields these properties:

  • Tokens within the same frame (indices i,ji,j such that iK=jK\left\lfloor \frac{i}{K}\right\rfloor = \left\lfloor \frac{j}{K} \right\rfloor) have unconstrained attention.
  • Tokens in any frame tt cannot attend to future frame tokens, strictly enforcing the forward flow of temporal information.

The learnable summary token, appended at the end and assigned frame index TT, can only read from the full set of previous frames but cannot influence them, embodying the role of a “causal anchor” for final video-level context aggregation.

The CATP transformer update for each layer \ell takes the form:

X=LayerNorm(X1)+MHAtt(X1,X1,X1;M)+FFN(LayerNorm())\mathbf X_\ell = \mathrm{LayerNorm}(\mathbf X_{\ell-1}) + \mathrm{MHAtt}(\mathbf X_{\ell-1},\mathbf X_{\ell-1},\mathbf X_{\ell-1};M) + \mathrm{FFN}(\mathrm{LayerNorm}(\cdot))

For the summary token, this specializes to:

s()=s(1)+Softmax(QsKTTdv)VTs^{(\ell)} = s^{(\ell-1)} + \mathrm{Softmax}\left(\frac{Q_s K_{\leq T}^T}{\sqrt{d_v}}\right) V_{\leq T}

ensuring the summary is a strictly causal aggregation of prior frames.

4. Parameter and Computational Efficiency

CATP achieves high efficiency with minimal trainable parameters:

  • LSA parameters: 3×dv23\times d_v^2 for WQ,WK,WVW^Q,W^K,W^V (3.1\sim 3.1M for dv=1024d_v=1024).
  • CATP transformer: For L=2L=2 layers: each layer has WQ,WK,WV,WOW_Q, W_K, W_V, W_O (each 1024×10241024\times1024), yielding 4.2\sim 4.2M parameters per layer. The feedforward network contributes 8.4\sim 8.4M parameters.
  • End-to-end: Total added parameters for LSA+CATP are 13\leq 13M.
  • LLM adaptation: The backbone LLM (Vicuna-7B) is frozen. Adaptation uses 4-bit QLoRA for WqW_q and WvW_v in attention layers, LoRA rank r=64r=64, scaling α=128\alpha=128, resulting in an additional 34\leq 34M low-rank, 4-bit parameters.
  • Training: The full trainable parameter count (<50<50M) allows single-GPU (24GB RAM) end-to-end training.

5. Experimental Evaluation and Ablation Results

CATP, within V-CORE, demonstrates empirical improvements on established VideoQA benchmarks, particularly for tasks requiring temporal and causal inference.

Benchmark Performance

  • NExT-QA: V-CORE achieves 61.2% accuracy, surpassing a bidirectional transformer baseline at 58.0%.
  • Temporal reasoning: Absolute performance gain of +3.5% in relevant sub-tasks.
  • Causal reasoning: Absolute gain of +5.2% over the bidirectional baseline.

Ablation Study

Configuration NExT-QA Accuracy (%)
MeanPool spatial + linear temporal 53.8
+ LSA (Learnable Spatial Aggregation) 55.6
+ Bidirectional Transformer 58.0
+ Block-causal CATP 60.2
+ Terminal Dynamic Summary 61.2

The addition of LSA and the shift to block-causal attention yield progressive improvements, with the terminal dynamic summary contributing the final performance increase. This validates that unidirectional, frame-wise attention and a causal sink summary token enhance the coherence of temporal and causal reasoning.

6. Significance and Implications

CATP corrects a fundamental shortcoming of naive bidirectional attention in Video-LLMs with respect to temporal and causal video understanding. By architecturally enforcing the arrow of time and aggregating temporally ordered information, CATP provides a robust computational substrate for downstream LLMs to answer questions dependent on event chronology and causality. The approach is parameter-efficient, scalable to standard hardware, and preserves performance on tasks that are not exclusively temporal or causal in nature.

A plausible implication is that similar block-causal masking and summary-token strategies could be beneficial in other sequential multimodal reasoning domains where temporal leakage is detrimental to interpretability or task correctness.

CATP builds upon and differentiates itself from standard transformer attention, traditional causal masks (which fail to preserve intra-frame spatial modeling when applied naively), and prior work collapsing video frames via mean pooling or linear projection. The explicit block-causal design allows unconstrained spatial reasoning while preserving the temporal unidirectionality essential for fine-grained event attribution. Its integration with parameter-efficient LLM adaptation techniques, such as QLoRA, further aligns it with contemporary trends towards resource-efficient large model adaptation and inference (Kang et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causality-Aware Temporal Projector (CATP).