Temporal Attention Signatures (T-Sigs)

Updated 23 November 2025

Temporal Attention Signatures (T-Sigs) are a mathematically defined construct that quantifies unique temporal attention patterns in transformer-based video models.
They are computed by averaging multi-head self-attention weights from the penultimate block across frame displacements, facilitating model-agnostic forensic analysis.
Experimental validations show high intra-class consistency and strong inter-class discrimination, making T-Sigs effective for tracking video provenance and detecting novel generators.

Temporal Attention Signatures (T-Sigs) are a mathematically defined interpretability construct introduced in the SAGA framework for source attribution of generative AI videos. T-Sigs capture the unique temporal attention patterns that transformer-based video models learn when distinguishing between real and AI-generated videos, as well as between different synthesis models. By quantifying and visualizing frame-to-frame attention across multiple levels of attribution—including authenticity, generation type, model version, development team, and specific generator—T-Sigs provide direct insight into the temporal artifacts characteristic of each video source. Unlike manual feature engineering, this approach relies on the internal behavior of the attribution transformer itself, enabling model-agnostic, data-driven forensic analysis (Kundu et al., 16 Nov 2025).

1. Mathematical Formulation of Temporal Attention Signatures

Let a video $x_k$ comprise $L$ frames, processed by a transformer with multi-head self-attention (MHSA). For each head $h$ , the second-to-last temporal transformer block produces an attention matrix $A_k^{(h)} \in \mathbb{R}^{L \times L}$ such that: $A_k^{(h)}[i, j] = \text{softmax}_j \left( \frac{Q_{k,i}^{(h)} \cdot K_{k,j}^{(h)\,T}}{ \sqrt{d_h} } \right)$ with $d_h = d_{\text{model}} / H$ ( $H$ being the number of heads). Each row of $A_k^{(h)}$ sums to 1.

The per-video, per-displacement temporal attention profile quantifies average attention from a given frame $t$ to its $d$ -ahead successor: $\alpha_k(d) = \frac{1}{H} \sum_{h=1}^H \frac{1}{L - d} \sum_{t=1}^{L - d} A_k^{(h)}[t, t + d] \quad\text{for}\;d = 1, \ldots, L-1$ For a given class $c$ (e.g., "Stable Diffusion v2.1", "Real", or a specific video generator), the Temporal Attention Signature is defined as: $\mathrm{TSig}_c(d) = \frac{1}{N_c} \sum_{k: y_k = c} \alpha_k(d)$ where $N_c$ is the number of videos in class $c$ . $\mathrm{TSig}_c \in \mathbb{R}^{L-1}$ is the characteristic vector for class $c$ . Optional normalization (e.g., unit sum or $\ell_2$ -norm) is possible but not required due to softmax invariance and inter-class comparability.

2. End-to-End Pipeline for T-Sig Computation

The T-Sig extraction pipeline in SAGA is outlined as follows:

Frame extraction and preprocessing: Uniformly sample $L$ frames from each video, resizing (e.g., to $224 \times 224$ ).
Tokenization with frozen vision encoder: Pass each frame through a pretrained CLIP- or ViT-style encoder, yielding $l_t$ tokens per frame of dimension $d_t$ , producing $z_m \in \mathbb{R}^{l_t \times d_t}$ .
Per-frame spatial encoding: Apply a transformer block with self-attention across tokens, then average-pool to a single vector $f_m \in \mathbb{R}^{d_t}$ per frame. Concatenate across frames to form $\zeta_k \in \mathbb{R}^{L \times d_t}$ .
Temporal transformer encoding: Add sinusoidal positional embeddings of shape $L \times d_t$ , then process with $D$ transformer blocks. Each block includes MHSA ( $H=12$ heads, $d_h = d_t/H$ ), LayerNorm, residual connections, dropout, and an MLP.
Penultimate attention extraction: Retrieve $A_k^{(h)} \in \mathbb{R}^{L \times L}$ from each head $h$ in the penultimate block.
Per-video temporal profile computation: For each $d = 1, \ldots, L-1$ , average $A_k^{(h)}[t, t+d]$ over $t$ and heads to yield $\alpha_k(d)$ .
Class-level aggregation: Average $\alpha_k(d)$ over all $N_c$ videos in class $c$ to obtain $\mathrm{TSig}_c(d)$ .

3. Tensor Dimensions and Normalization Properties

The SAGA pipeline maintains precise tensor shapes throughout the computation:

Stage	Tensor Shape	Notes
Input frames	$L$	Typically 8, 16
Tokens per frame	$l_t$	E.g., 197
Token dimension	$d_t$	E.g., 768
Spatial encoder out	$f_m \in \mathbb{R}^{d_t}$	Per frame
Stacked embeddings	$\zeta_k \in \mathbb{R}^{L \times d_t}$	Full video
MHSA head count	$H=12$ , $d_h = d_t/H$
Attention matrices	$A_k^{(h)} \in \mathbb{R}^{L \times L}$	Per head
Per-video profile	$\alpha_k \in \mathbb{R}^{L-1}$	Offsets $d$
Class-level T-Sig	$\mathrm{TSig}_c \in \mathbb{R}^{L-1}$

All MHSA attention weights are row-normalized via softmax, yielding inherently comparable and interpretable T-Sig vectors. Optional normalization (unit sum or unit norm) is permissible for visualization, but not essential for interpretive or discriminative utility.

4. Visualization of Temporal Attention Signatures

T-Sigs can be illustrated as curves of attention weight versus frame displacement $d$ . ASCII diagrams convey characteristic patterns:

Generator A (e.g., diffusion model with frame jitter):

d=1 ───●─────
d=2 ──●───
d=3 ─●──
d=4 ●
...

This pattern shows rapid decay—frames attend almost exclusively to immediate successors, consistent with temporal instability.

Generator B (e.g., VideoGAN with smooth transitions):

d=1 ──●────●───
d=2 ─●─   ●─   ●──
d=3 ●     ●     ●─
d=4        ●
...

Here, attention persists for larger

d

, reflecting greater temporal coherence. When multiple T-Sig curves are plotted together, each generator class yields a distinct, reproducible signature.

5. Interpretability Mechanisms

Temporal Attention Signatures provide several interpretability advantages:

Temporal artifact fingerprinting: Distinct video synthesis models leave characteristic temporal signatures (e.g., over-smooth transitions, inter-frame jitter), which are exposed by T-Sigs.
Model-agnostic visualization: No manual feature engineering is required. The SAGA transformer’s learned attention weights directly reveal the discriminative patterns.
Unseen-generator detection: Averaged T-Sig curves for unobserved generators differ from those of known models, enabling open-set and novelty detection.
Classifier-aligned explanations: T-Sigs summarize the precise temporal cues used by the classifier, since the final decision layer consumes representations built atop the very same attention patterns.

A plausible implication is that, since T-Sigs are observable at inference time for any input video, they also offer a route toward forensic transparency in regulatory contexts.

6. Experimental Validation and Discriminative Power

Key experiments demonstrate the validity and utility of Temporal Attention Signatures:

Stable intra-class signatures: Each class produces a uniquely shaped $\mathrm{TSig}$ , and repeated subsampling yields curves with intra-class correlation $\geq 0.90$ .
Inter-class separation: Pairwise correlation between T-Sigs of different classes is low ( $\leq 0.25$ ), confirming discriminative power.
Open-set capability: Unseen generators not present in training produce T-Sigs unmatched by known classes. Nearest-neighbor matching raises correct novelty flags in over 85% of cases.
Layer ablation study: Extracting T-Sigs from the final transformer block (rather than the penultimate) degrades curve quality (intra-class correlation drops by $\sim$ 0.15) and separation.
Human interpretability: In a user study, forensic analysts assigned videos to generators using only T-Sig curves, achieving 78% accuracy (chance = 33%).

Collectively, these results show that T-Sigs are highly reproducible, discriminative, and easily interpreted signatures directly tied to generative model provenance (Kundu et al., 16 Nov 2025).

7. Applications and Forensic Implications

Temporal Attention Signatures, as deployed in SAGA, enable multi-granular video provenance tracing—spanning basic authenticity checks, generator family grouping, specific version identification, attribution to development teams, and oracle-level discrimination of individual generator instances. They provide essential transparency for forensic and regulatory applications, particularly as video synthesis models proliferate and evolve. The method is compatible with both known and novel generators, facilitating both closed- and open-set forensic scenarios.

This approach establishes a new benchmark for interpretable synthetic video attribution, providing insight into the precise temporal artifacts leveraged by transformer-based architectures for model-agnostic forensic analysis (Kundu et al., 16 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SAGA: Source Attribution of Generative AI Videos (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Attention Signatures (T-Sigs).

Temporal Attention Signatures (T-Sigs)

1. Mathematical Formulation of Temporal Attention Signatures

2. End-to-End Pipeline for T-Sig Computation

3. Tensor Dimensions and Normalization Properties

4. Visualization of Temporal Attention Signatures

5. Interpretability Mechanisms

6. Experimental Validation and Discriminative Power

7. Applications and Forensic Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Temporal Attention Signatures (T-Sigs)

1. Mathematical Formulation of Temporal Attention Signatures

2. End-to-End Pipeline for T-Sig Computation

3. Tensor Dimensions and Normalization Properties

4. Visualization of Temporal Attention Signatures

5. Interpretability Mechanisms

6. Experimental Validation and Discriminative Power

7. Applications and Forensic Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research