Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal Attention Signatures (T-Sigs)

Updated 23 November 2025
  • Temporal Attention Signatures (T-Sigs) are a mathematically defined construct that quantifies unique temporal attention patterns in transformer-based video models.
  • They are computed by averaging multi-head self-attention weights from the penultimate block across frame displacements, facilitating model-agnostic forensic analysis.
  • Experimental validations show high intra-class consistency and strong inter-class discrimination, making T-Sigs effective for tracking video provenance and detecting novel generators.

Temporal Attention Signatures (T-Sigs) are a mathematically defined interpretability construct introduced in the SAGA framework for source attribution of generative AI videos. T-Sigs capture the unique temporal attention patterns that transformer-based video models learn when distinguishing between real and AI-generated videos, as well as between different synthesis models. By quantifying and visualizing frame-to-frame attention across multiple levels of attribution—including authenticity, generation type, model version, development team, and specific generator—T-Sigs provide direct insight into the temporal artifacts characteristic of each video source. Unlike manual feature engineering, this approach relies on the internal behavior of the attribution transformer itself, enabling model-agnostic, data-driven forensic analysis (Kundu et al., 16 Nov 2025).

1. Mathematical Formulation of Temporal Attention Signatures

Let a video xkx_k comprise LL frames, processed by a transformer with multi-head self-attention (MHSA). For each head hh, the second-to-last temporal transformer block produces an attention matrix Ak(h)RL×LA_k^{(h)} \in \mathbb{R}^{L \times L} such that: Ak(h)[i,j]=softmaxj(Qk,i(h)Kk,j(h)Tdh)A_k^{(h)}[i, j] = \text{softmax}_j \left( \frac{Q_{k,i}^{(h)} \cdot K_{k,j}^{(h)\,T}}{ \sqrt{d_h} } \right) with dh=dmodel/Hd_h = d_{\text{model}} / H (HH being the number of heads). Each row of Ak(h)A_k^{(h)} sums to 1.

The per-video, per-displacement temporal attention profile quantifies average attention from a given frame tt to its dd-ahead successor: αk(d)=1Hh=1H1Ldt=1LdAk(h)[t,t+d]for  d=1,,L1\alpha_k(d) = \frac{1}{H} \sum_{h=1}^H \frac{1}{L - d} \sum_{t=1}^{L - d} A_k^{(h)}[t, t + d] \quad\text{for}\;d = 1, \ldots, L-1 For a given class cc (e.g., "Stable Diffusion v2.1", "Real", or a specific video generator), the Temporal Attention Signature is defined as: TSigc(d)=1Nck:yk=cαk(d)\mathrm{TSig}_c(d) = \frac{1}{N_c} \sum_{k: y_k = c} \alpha_k(d) where NcN_c is the number of videos in class cc. TSigcRL1\mathrm{TSig}_c \in \mathbb{R}^{L-1} is the characteristic vector for class cc. Optional normalization (e.g., unit sum or 2\ell_2-norm) is possible but not required due to softmax invariance and inter-class comparability.

2. End-to-End Pipeline for T-Sig Computation

The T-Sig extraction pipeline in SAGA is outlined as follows:

  1. Frame extraction and preprocessing: Uniformly sample LL frames from each video, resizing (e.g., to 224×224224 \times 224).
  2. Tokenization with frozen vision encoder: Pass each frame through a pretrained CLIP- or ViT-style encoder, yielding ltl_t tokens per frame of dimension dtd_t, producing zmRlt×dtz_m \in \mathbb{R}^{l_t \times d_t}.
  3. Per-frame spatial encoding: Apply a transformer block with self-attention across tokens, then average-pool to a single vector fmRdtf_m \in \mathbb{R}^{d_t} per frame. Concatenate across frames to form ζkRL×dt\zeta_k \in \mathbb{R}^{L \times d_t}.
  4. Temporal transformer encoding: Add sinusoidal positional embeddings of shape L×dtL \times d_t, then process with DD transformer blocks. Each block includes MHSA (H=12H=12 heads, dh=dt/Hd_h = d_t/H), LayerNorm, residual connections, dropout, and an MLP.
  5. Penultimate attention extraction: Retrieve Ak(h)RL×LA_k^{(h)} \in \mathbb{R}^{L \times L} from each head hh in the penultimate block.
  6. Per-video temporal profile computation: For each d=1,,L1d = 1, \ldots, L-1, average Ak(h)[t,t+d]A_k^{(h)}[t, t+d] over tt and heads to yield αk(d)\alpha_k(d).
  7. Class-level aggregation: Average αk(d)\alpha_k(d) over all NcN_c videos in class cc to obtain TSigc(d)\mathrm{TSig}_c(d).

3. Tensor Dimensions and Normalization Properties

The SAGA pipeline maintains precise tensor shapes throughout the computation:

Stage Tensor Shape Notes
Input frames LL Typically 8, 16
Tokens per frame ltl_t E.g., 197
Token dimension dtd_t E.g., 768
Spatial encoder out fmRdtf_m \in \mathbb{R}^{d_t} Per frame
Stacked embeddings ζkRL×dt\zeta_k \in \mathbb{R}^{L \times d_t} Full video
MHSA head count H=12H=12, dh=dt/Hd_h = d_t/H
Attention matrices Ak(h)RL×LA_k^{(h)} \in \mathbb{R}^{L \times L} Per head
Per-video profile αkRL1\alpha_k \in \mathbb{R}^{L-1} Offsets dd
Class-level T-Sig TSigcRL1\mathrm{TSig}_c \in \mathbb{R}^{L-1}

All MHSA attention weights are row-normalized via softmax, yielding inherently comparable and interpretable T-Sig vectors. Optional normalization (unit sum or unit norm) is permissible for visualization, but not essential for interpretive or discriminative utility.

4. Visualization of Temporal Attention Signatures

T-Sigs can be illustrated as curves of attention weight versus frame displacement dd. ASCII diagrams convey characteristic patterns:

Generator A (e.g., diffusion model with frame jitter):

1
2
3
4
5
d=1 ───●─────
d=2 ──●───
d=3 ─●──
d=4 ●
...
This pattern shows rapid decay—frames attend almost exclusively to immediate successors, consistent with temporal instability.

Generator B (e.g., VideoGAN with smooth transitions):

1
2
3
4
5
d=1 ──●────●───
d=2 ─●─   ●─   ●──
d=3 ●     ●     ●─
d=4        ●
...
Here, attention persists for larger dd, reflecting greater temporal coherence. When multiple T-Sig curves are plotted together, each generator class yields a distinct, reproducible signature.

5. Interpretability Mechanisms

Temporal Attention Signatures provide several interpretability advantages:

  • Temporal artifact fingerprinting: Distinct video synthesis models leave characteristic temporal signatures (e.g., over-smooth transitions, inter-frame jitter), which are exposed by T-Sigs.
  • Model-agnostic visualization: No manual feature engineering is required. The SAGA transformer’s learned attention weights directly reveal the discriminative patterns.
  • Unseen-generator detection: Averaged T-Sig curves for unobserved generators differ from those of known models, enabling open-set and novelty detection.
  • Classifier-aligned explanations: T-Sigs summarize the precise temporal cues used by the classifier, since the final decision layer consumes representations built atop the very same attention patterns.

A plausible implication is that, since T-Sigs are observable at inference time for any input video, they also offer a route toward forensic transparency in regulatory contexts.

6. Experimental Validation and Discriminative Power

Key experiments demonstrate the validity and utility of Temporal Attention Signatures:

  • Stable intra-class signatures: Each class produces a uniquely shaped TSig\mathrm{TSig}, and repeated subsampling yields curves with intra-class correlation 0.90\geq 0.90.
  • Inter-class separation: Pairwise correlation between T-Sigs of different classes is low (0.25\leq 0.25), confirming discriminative power.
  • Open-set capability: Unseen generators not present in training produce T-Sigs unmatched by known classes. Nearest-neighbor matching raises correct novelty flags in over 85% of cases.
  • Layer ablation study: Extracting T-Sigs from the final transformer block (rather than the penultimate) degrades curve quality (intra-class correlation drops by \sim0.15) and separation.
  • Human interpretability: In a user study, forensic analysts assigned videos to generators using only T-Sig curves, achieving 78% accuracy (chance = 33%).

Collectively, these results show that T-Sigs are highly reproducible, discriminative, and easily interpreted signatures directly tied to generative model provenance (Kundu et al., 16 Nov 2025).

7. Applications and Forensic Implications

Temporal Attention Signatures, as deployed in SAGA, enable multi-granular video provenance tracing—spanning basic authenticity checks, generator family grouping, specific version identification, attribution to development teams, and oracle-level discrimination of individual generator instances. They provide essential transparency for forensic and regulatory applications, particularly as video synthesis models proliferate and evolve. The method is compatible with both known and novel generators, facilitating both closed- and open-set forensic scenarios.

This approach establishes a new benchmark for interpretable synthetic video attribution, providing insight into the precise temporal artifacts leveraged by transformer-based architectures for model-agnostic forensic analysis (Kundu et al., 16 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Attention Signatures (T-Sigs).