Time Blindness: Why Video-Language Models Can't See What Humans Can? (2505.24867v1)

Published 30 May 2025 in cs.CV and cs.AI

Abstract: Recent advances in vision-LLMs (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

Summary

An Examination of Video-LLMs' Temporal Reasoning Capabilities

The paper "Time Blindness: Why Video-LLMs Can't See What Humans Can?" addresses a notable deficiency in current video-LLMs (VLMs) – their inability to interpret temporal patterns devoid of spatial cues. While recent advancements in VLMs have equipped these models with formidable capabilities for recognizing spatial features and performing tasks across numerous video processing applications, their effectiveness in purely temporal understanding remains limited.

Core Contributions and Findings

The authors propose a novel benchmark, SpookyBench, specifically designed to evaluate the temporal reasoning abilities of VLMs. Unlike existing benchmarks that incorporate both spatial and temporal cues, SpookyBench isolates temporal comprehension by presenting information solely through sequences of noise-like frames. This benchmark draws on natural phenomena such as the temporal signaling used by fireflies and certain communication protocols in technology, where meaning is conveyed through timing rather than distinct spatial patterns.

The authors provide compelling numerical evidence of the extent of this shortcoming. Humans can recognize shapes, text, and patterns in these temporally encoded sequences with over 98% accuracy. However, state-of-the-art VLMs display a contrasting performance with an accuracy of 0%, underscoring a prevalent reliance on spatial features and a fundamental deficiency in processing temporal information.

Technical and Architectural Analysis

The paper identifies key architectural shortcomings in current VLMs. Existing models typically process videos by extracting spatial features frame-by-frame through Vision Transformers (ViTs) and subsequently integrating these features over time. This dependence on spatial feature extraction relegates temporal reasoning to a secondary role. Consequently, when spatial information is minimized or absent, VLMs fail to derive meaning from temporal cues adequately.

The paper details assessments across various video categories using the developed SpookyBench dataset, including text, shapes, object images, and dynamic scenes. Notably, models perform uniformly poorly irrespective of the amount of pre-training and parameter counts, with even specialized architectures like TimeChat and LongVLM failing to interpret temporal patterns in the absence of spatial cues.

Implications for Future Research in AI

The implications of these findings are multifaceted. Practically, VLMs need to evolve beyond spatial-dependence to achieve reliable performance in environments where temporal reasoning is critical. The inability to capture temporal patterns impacts real-world applications such as autonomous driving and security surveillance, where temporal cues can be essential indicators of intent or behavior.

Moreover, from a theoretical standpoint, the authors argue for a paradigm shift. Integrating insights from cognitive science and neuroscience, future models should consider architectural innovations that prioritize temporal information as a primary processing component rather than an adjunct to spatial recognition. This could involve implementing neural architectures that model distributed representations of temporal dynamics, inspired by the human brain's mechanisms for temporal perception.

Conclusion

The paper "Time Blindness" elucidates a critical gap in the field of video-LLMing by systematically exploring VLMs' shortcomings in temporal reasoning. By introducing SpookyBench, the authors not only highlight the limitations but also provide a platform for advancing research towards bridging the perceptual gap between human and machine understanding. This work encourages the development of next-generation models capable of matching human-like temporal reasoning, thus expanding the potential applications and understanding of AI in dynamic, complex environments.