Temporal Attention Signatures (T-Sigs)
- Temporal Attention Signatures (T-Sigs) are a mathematically defined construct that quantifies unique temporal attention patterns in transformer-based video models.
- They are computed by averaging multi-head self-attention weights from the penultimate block across frame displacements, facilitating model-agnostic forensic analysis.
- Experimental validations show high intra-class consistency and strong inter-class discrimination, making T-Sigs effective for tracking video provenance and detecting novel generators.
Temporal Attention Signatures (T-Sigs) are a mathematically defined interpretability construct introduced in the SAGA framework for source attribution of generative AI videos. T-Sigs capture the unique temporal attention patterns that transformer-based video models learn when distinguishing between real and AI-generated videos, as well as between different synthesis models. By quantifying and visualizing frame-to-frame attention across multiple levels of attribution—including authenticity, generation type, model version, development team, and specific generator—T-Sigs provide direct insight into the temporal artifacts characteristic of each video source. Unlike manual feature engineering, this approach relies on the internal behavior of the attribution transformer itself, enabling model-agnostic, data-driven forensic analysis (Kundu et al., 16 Nov 2025).
1. Mathematical Formulation of Temporal Attention Signatures
Let a video comprise frames, processed by a transformer with multi-head self-attention (MHSA). For each head , the second-to-last temporal transformer block produces an attention matrix such that: with ( being the number of heads). Each row of sums to 1.
The per-video, per-displacement temporal attention profile quantifies average attention from a given frame to its -ahead successor: For a given class (e.g., "Stable Diffusion v2.1", "Real", or a specific video generator), the Temporal Attention Signature is defined as: where is the number of videos in class . is the characteristic vector for class . Optional normalization (e.g., unit sum or -norm) is possible but not required due to softmax invariance and inter-class comparability.
2. End-to-End Pipeline for T-Sig Computation
The T-Sig extraction pipeline in SAGA is outlined as follows:
- Frame extraction and preprocessing: Uniformly sample frames from each video, resizing (e.g., to ).
- Tokenization with frozen vision encoder: Pass each frame through a pretrained CLIP- or ViT-style encoder, yielding tokens per frame of dimension , producing .
- Per-frame spatial encoding: Apply a transformer block with self-attention across tokens, then average-pool to a single vector per frame. Concatenate across frames to form .
- Temporal transformer encoding: Add sinusoidal positional embeddings of shape , then process with transformer blocks. Each block includes MHSA ( heads, ), LayerNorm, residual connections, dropout, and an MLP.
- Penultimate attention extraction: Retrieve from each head in the penultimate block.
- Per-video temporal profile computation: For each , average over and heads to yield .
- Class-level aggregation: Average over all videos in class to obtain .
3. Tensor Dimensions and Normalization Properties
The SAGA pipeline maintains precise tensor shapes throughout the computation:
| Stage | Tensor Shape | Notes |
|---|---|---|
| Input frames | Typically 8, 16 | |
| Tokens per frame | E.g., 197 | |
| Token dimension | E.g., 768 | |
| Spatial encoder out | Per frame | |
| Stacked embeddings | Full video | |
| MHSA head count | , | |
| Attention matrices | Per head | |
| Per-video profile | Offsets | |
| Class-level T-Sig |
All MHSA attention weights are row-normalized via softmax, yielding inherently comparable and interpretable T-Sig vectors. Optional normalization (unit sum or unit norm) is permissible for visualization, but not essential for interpretive or discriminative utility.
4. Visualization of Temporal Attention Signatures
T-Sigs can be illustrated as curves of attention weight versus frame displacement . ASCII diagrams convey characteristic patterns:
Generator A (e.g., diffusion model with frame jitter):
1 2 3 4 5 |
d=1 ───●───── d=2 ──●─── d=3 ─●── d=4 ● ... |
Generator B (e.g., VideoGAN with smooth transitions):
1 2 3 4 5 |
d=1 ──●────●─── d=2 ─●─ ●─ ●── d=3 ● ● ●─ d=4 ● ... |
5. Interpretability Mechanisms
Temporal Attention Signatures provide several interpretability advantages:
- Temporal artifact fingerprinting: Distinct video synthesis models leave characteristic temporal signatures (e.g., over-smooth transitions, inter-frame jitter), which are exposed by T-Sigs.
- Model-agnostic visualization: No manual feature engineering is required. The SAGA transformer’s learned attention weights directly reveal the discriminative patterns.
- Unseen-generator detection: Averaged T-Sig curves for unobserved generators differ from those of known models, enabling open-set and novelty detection.
- Classifier-aligned explanations: T-Sigs summarize the precise temporal cues used by the classifier, since the final decision layer consumes representations built atop the very same attention patterns.
A plausible implication is that, since T-Sigs are observable at inference time for any input video, they also offer a route toward forensic transparency in regulatory contexts.
6. Experimental Validation and Discriminative Power
Key experiments demonstrate the validity and utility of Temporal Attention Signatures:
- Stable intra-class signatures: Each class produces a uniquely shaped , and repeated subsampling yields curves with intra-class correlation .
- Inter-class separation: Pairwise correlation between T-Sigs of different classes is low (), confirming discriminative power.
- Open-set capability: Unseen generators not present in training produce T-Sigs unmatched by known classes. Nearest-neighbor matching raises correct novelty flags in over 85% of cases.
- Layer ablation study: Extracting T-Sigs from the final transformer block (rather than the penultimate) degrades curve quality (intra-class correlation drops by 0.15) and separation.
- Human interpretability: In a user study, forensic analysts assigned videos to generators using only T-Sig curves, achieving 78% accuracy (chance = 33%).
Collectively, these results show that T-Sigs are highly reproducible, discriminative, and easily interpreted signatures directly tied to generative model provenance (Kundu et al., 16 Nov 2025).
7. Applications and Forensic Implications
Temporal Attention Signatures, as deployed in SAGA, enable multi-granular video provenance tracing—spanning basic authenticity checks, generator family grouping, specific version identification, attribution to development teams, and oracle-level discrimination of individual generator instances. They provide essential transparency for forensic and regulatory applications, particularly as video synthesis models proliferate and evolve. The method is compatible with both known and novel generators, facilitating both closed- and open-set forensic scenarios.
This approach establishes a new benchmark for interpretable synthetic video attribution, providing insight into the precise temporal artifacts leveraged by transformer-based architectures for model-agnostic forensic analysis (Kundu et al., 16 Nov 2025).