Causality-Aware Temporal Projector (CATP)

Updated 12 January 2026

CATP is a video understanding module applying temporal constraints to enhance causality perception.
Alternative approaches allow backward influence which CATP overcomes using a block-causal attention mask.
CATP offers measurable gains in benchmark accuracy, enhancing temporal and causal reasoning capabilities.

The Causality-Aware Temporal Projector (CATP) is a transformer-based module designed for video understanding within Video LLMs (Video-LLMs) that require precise modeling of temporal ordering and causality. CATP was proposed in the V-CORE framework and explicitly constrains temporal information flow, addressing limitations of standard bidirectional attention mechanisms in existing video-language architectures. CATP achieves parameter efficiency, maintains unbroken intra-frame spatial interactions, and provides measurable improvements on benchmarks requiring temporal and causal reasoning (Kang et al., 5 Jan 2026).

1. Motivation and Challenges in Temporal Video Modeling

Many Video-LLMs utilize bidirectional transformer projectors to capture inter-frame dependencies. While effective for general multimodal reasoning, these projectors allow future frame features to influence earlier event representations, undermining the “arrow of time” crucial for causal questions, e.g., “what happened first?” or “which event caused X?”. This bidirectional leakage leads to ambiguity in tasks where temporal order and causal attribution are essential, as exemplified by failures on causal and temporal components of the NExT-QA benchmark. CATP directly addresses this by enforcing unidirectional, frame-by-frame aggregation:

Full bidirectional processing is preserved within each frame to maintain spatial token interactions.
A block-causal attention mask ensures tokens from frame $t$ can attend only to the same or previous frames ( $\leq t$ ), explicitly preventing backward leakage of information.
A learnable terminal summary token, appended to the sequence, acts exclusively as a “causal sink,” aggregating global context without the ability to overwrite past frame representations.

2. Integration of CATP Within the V-CORE Framework

CATP is one of two central modules within the V-CORE architecture; the other is Learnable Spatial Aggregation (LSA). The workflow is as follows:

Frame Encoding: Each input video frame is encoded by a frozen ViT-L/14, yielding $N=256$ patch embeddings per frame, $F_t\in\mathbb{R}^{N\times d_v}$ .
Learnable Spatial Aggregation (LSA): Each $F_t$ is pooled to $K$ salient tokens via learnable queries:

$Q_s = \mathbf Q W^Q, \quad K_s = F_t W^K, \quad V_s = F_t W^V$

$H_t = \mathrm{Softmax}\left(\frac{Q_s K_s^T}{\sqrt{d_v}}\right)V_s \;\in\; \mathbb{R}^{K \times d_v}$

Typically, $K=16$ , significantly reducing patch redundancy.

Sequence Formation: Flatten all $T$ frames’ tokens and append the learnable summary token $\leq t$ 0, forming $\leq t$ 1.
Spatio-temporal Positional Encoding: Joint positional embeddings $\leq t$ 2 are added so each token encodes its global frame index and intra-frame position.
CATP Transformer: The resulting input is processed by stacked CATP transformer layers, embedding explicit temporal constraints.
Projection and LLM Input: The $\leq t$ 3 output tokens are projected into the LLM embedding space and prepended to the textual instruction for subsequent language generation or question answering.

3. Block-Causal Attention and Summary Token Dynamics

The core of CATP is its spatio-temporal block-causal masking within the transformer self-attention. The block-causal mask $\leq t$ 4 is constructed such that:

$\leq t$ 5

This configuration yields these properties:

Tokens within the same frame (indices $\leq t$ 6 such that $\leq t$ 7) have unconstrained attention.
Tokens in any frame $\leq t$ 8 cannot attend to future frame tokens, strictly enforcing the forward flow of temporal information.

The learnable summary token, appended at the end and assigned frame index $\leq t$ 9, can only read from the full set of previous frames but cannot influence them, embodying the role of a “causal anchor” for final video-level context aggregation.

The CATP transformer update for each layer $N=256$ 0 takes the form:

$N=256$ 1

For the summary token, this specializes to:

$N=256$ 2

ensuring the summary is a strictly causal aggregation of prior frames.

4. Parameter and Computational Efficiency

CATP achieves high efficiency with minimal trainable parameters:

LSA parameters: $N=256$ 3 for $N=256$ 4 ( $N=256$ 5M for $N=256$ 6).
CATP transformer: For $N=256$ 7 layers: each layer has $N=256$ 8 (each $N=256$ 9), yielding $F_t\in\mathbb{R}^{N\times d_v}$ 0M parameters per layer. The feedforward network contributes $F_t\in\mathbb{R}^{N\times d_v}$ 1M parameters.
End-to-end: Total added parameters for LSA+CATP are $F_t\in\mathbb{R}^{N\times d_v}$ 2M.
LLM adaptation: The backbone LLM (Vicuna-7B) is frozen. Adaptation uses 4-bit QLoRA for $F_t\in\mathbb{R}^{N\times d_v}$ 3 and $F_t\in\mathbb{R}^{N\times d_v}$ 4 in attention layers, LoRA rank $F_t\in\mathbb{R}^{N\times d_v}$ 5, scaling $F_t\in\mathbb{R}^{N\times d_v}$ 6, resulting in an additional $F_t\in\mathbb{R}^{N\times d_v}$ 7M low-rank, 4-bit parameters.
Training: The full trainable parameter count ( $F_t\in\mathbb{R}^{N\times d_v}$ 8M) allows single-GPU (24GB RAM) end-to-end training.

5. Experimental Evaluation and Ablation Results

CATP, within V-CORE, demonstrates empirical improvements on established VideoQA benchmarks, particularly for tasks requiring temporal and causal inference.

Benchmark Performance

NExT-QA: V-CORE achieves 61.2% accuracy, surpassing a bidirectional transformer baseline at 58.0%.
Temporal reasoning: Absolute performance gain of +3.5% in relevant sub-tasks.
Causal reasoning: Absolute gain of +5.2% over the bidirectional baseline.

Ablation Study

Configuration	NExT-QA Accuracy (%)
MeanPool spatial + linear temporal	53.8
+ LSA (Learnable Spatial Aggregation)	55.6
+ Bidirectional Transformer	58.0
+ Block-causal CATP	60.2
+ Terminal Dynamic Summary	61.2

The addition of LSA and the shift to block-causal attention yield progressive improvements, with the terminal dynamic summary contributing the final performance increase. This validates that unidirectional, frame-wise attention and a causal sink summary token enhance the coherence of temporal and causal reasoning.

6. Significance and Implications

CATP corrects a fundamental shortcoming of naive bidirectional attention in Video-LLMs with respect to temporal and causal video understanding. By architecturally enforcing the arrow of time and aggregating temporally ordered information, CATP provides a robust computational substrate for downstream LLMs to answer questions dependent on event chronology and causality. The approach is parameter-efficient, scalable to standard hardware, and preserves performance on tasks that are not exclusively temporal or causal in nature.

A plausible implication is that similar block-causal masking and summary-token strategies could be beneficial in other sequential multimodal reasoning domains where temporal leakage is detrimental to interpretability or task correctness.

CATP builds upon and differentiates itself from standard transformer attention, traditional causal masks (which fail to preserve intra-frame spatial modeling when applied naively), and prior work collapsing video frames via mean pooling or linear projection. The explicit block-causal design allows unconstrained spatial reasoning while preserving the temporal unidirectionality essential for fine-grained event attribution. Its integration with parameter-efficient LLM adaptation techniques, such as QLoRA, further aligns it with contemporary trends towards resource-efficient large model adaptation and inference (Kang et al., 5 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Causality-Aware Temporal Projection for Video Understanding in Video-LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causality-Aware Temporal Projector (CATP).