Create a Video View Paper

Learning Visual Representations From Pure Causality

This lightning talk introduces Temporal Difference in Vision (TDV), a radical new approach to self-supervised learning that throws out every conventional trick in the book. No augmentations, no cropping, no masking, no contrastive objectives. TDV learns visual representations from video using only one assumption: the past causes the future. By modeling the temporal difference between consecutive frames, it achieves competitive performance on dense spatial tasks while pointing toward a future where representation learning scales purely on data and causality, not hand-crafted biases.

Script

Every breakthrough in visual AI has stripped away assumptions. Convolutional networks needed spatial grids and labels. Self-supervised methods like SimCLR traded labels for augmentations. But what if we could learn representations using nothing but the arrow of time?

The authors propose Temporal Difference in Vision, or TDV. It learns by predicting the next frame's representation from the current one plus an abstract motion encoding derived from pixel differences. The causal structure of video becomes the only teacher.

The architecture is elegantly minimal. A motion encoder compresses frame differences while attending to the current frame's context. An exponential moving average teacher, borrowed from DINO, prevents collapse without any augmentations. Temporal consistency and additivity are the only constraints.

On semantic segmentation, TDV matches DINO and iBOT with broad region coverage, though boundary precision lags slightly. But on optical flow, it wins decisively with locally consistent, spatially coherent motion fields. The explicit modeling of temporal differences pays off exactly where you'd expect.

Without augmentations, TDV never learns the object-centric attention that DINO develops. Its representations are spatially coherent and boundary-aligned, but less semantically discriminative. It trades semantic invariance for pure temporal structure, revealing the cost of removing every inductive bias.

TDV proves you can learn visual representations from causality alone. As datasets grow and compute scales, these minimal assumptions may become the path forward, not just for vision but for any modality with temporal structure. Dive deeper into this work and create your own video summaries at EmergentMind.com.