TIM: A Time Interval Machine for Audio-Visual Action Recognition (2404.05559v2)

Published 8 Apr 2024 in cs.CV

Abstract: Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM

References (66)

Authors (5)

Jacob Chalk (4 papers)
Jaesung Huh (24 papers)
Evangelos Kazakos (13 papers)
Andrew Zisserman (248 papers)
Dima Damen (83 papers)

Citations (2)

View on Semantic Scholar

Summary

Enhancing Audio-Visual Action Recognition with Time Interval Queries in Long Videos

Introduction to Time Interval Machine (TIM)

In the field of audio-visual action recognition, understanding the intricate dynamics between audio and visual signals in long videos is paramount. Different actions yield diverse audio-visual cues with varying temporal extents, presenting a unique challenge for accurate action recognition. Addressing this, the proposed Time Interval Machine (TIM) introduces a novel approach by focusing on the temporal dimensions of audio-visual events. TIM effectively models actions as queries within specific time intervals, thereby enhancing recognition accuracy.

Modality-Specific Time Interval Representation

Traditional techniques typically leverage trimmed clips or exact temporal spans of actions without considering the untrimmed, long video context. TIM distinguishes itself by treating time intervals as primary entities, integrating these with modality-specific features to form comprehensive queries. This allows TIM to exploit correlations between auditory and visual modalities, including their background context, thereby improving the recognition of ongoing actions. For instance, TIM can differentiate between simultaneous actions such as "Rinse Sponge" and "Water Flow," even when they overlap within the same modality.

Empirical Validation and Results

TIM has been rigorously evaluated on leading audio-visual datasets, namely EPIC-KITCHENS, Perception Test, and AVE, showcasing superior performance across the board. Notably, TIM achieves a 2.9% top-1 accuracy improvement over the previous state-of-the-art on EPIC-KITCHENS for action recognition. Additionally, TIM's versatility extends to action detection through dense interval queries, setting new benchmarks on multiple metrics in EPIC-KITCHENS-100 and demonstrating robust performance on the Perception Test.

On EPIC-KITCHENS: TIM outperforms the competition with a notable margin in action recognition accuracy. Its efficacy is underscored by surpassing models that leverage significantly larger pre-training datasets or sophisticated semantic technologies.
Adaptation for Action Detection: By employing dense multi-scale interval queries, TIM extends its capability to action detection, outclassing existing state-of-the-art methods in both precision and generalization across different datasets.

Theoretical Contributions and Practical Implications

TIM introduces several innovations and advancements in audio-visual action recognition:

The concept of modality-specific time interval queries enriches the model's understanding of long videos, accommodating the distinct temporal characteristics of audio and visual events.
The incorporation of context from both modalities, including periods of inactivity, contributes to a more nuanced recognition of events.
Achieving new state-of-the-art results across multiple datasets underscores TIM's effectiveness and potential for real-world applications in surveillance, content management, and interactive systems.

Speculating on Future Developments

The impressive results achieved by TIM pave the way for further exploration into the integration of temporal dynamics with audio-visual data. Future research could delve into:

The exploration of more granular time interval queries to capture subtler distinctions and overlaps in actions.
Leveraging the model's insights for tasks beyond recognition and detection, such as event prediction and temporal segmentation.
Investigating the fusion of TIM with other modalities, such as depth or tactile sensors, to enrich the model's perception of physical interactions.

Conclusion

The introduction of the Time Interval Machine (TIM) represents a significant advance in audio-visual action recognition, particularly in the context of long videos. Through its innovative use of modality-specific time interval queries, TIM not only achieves state-of-the-art results but also opens new avenues for research in video understanding and multimodal signal processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/dimadamen/status/1778087585325212091

https://twitter.com/JacobChalkie/status/1804161688955969854

https://twitter.com/huh_jaesung/status/1778103558287589406

https://twitter.com/JacobChalkie/status/1778097692092100643

https://twitter.com/JacobChalkie/status/1826197652989423853

YouTube

Show All Videos