Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TIM: A Time Interval Machine for Audio-Visual Action Recognition (2404.05559v2)

Published 8 Apr 2024 in cs.CV

Abstract: Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Diffsed: Sound event detection with denoising diffusion. arXiv preprint arXiv:2308.07293, 2023.
  4. Soft-nms – improving object detection with one line of code, 2017.
  5. End-to-end object detection with transformers. In Proc. ECCV, 2020.
  6. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  7. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  8. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  9. Rescaling egocentric vision: Collection pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 2021.
  10. SlowFast Networks for Video Recognition. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
  11. Css-net: A consistent segment selection network for audio-visual event localization. IEEE Transactions on Multimedia, 2023.
  12. Listen to look: Action recognition by previewing audio. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  13. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  14. Omnivore: A Single Model for Many Visual Modalities. In CVPR, 2022.
  15. Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10699–10709, 2022.
  16. The ”something something” video database for learning and evaluating visual common sense, 2017.
  17. Ava: A video dataset of spatio-temporally localized atomic visual actions, 2018.
  18. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  19. Cnn architectures for large-scale audio classification. In ICASSP, 2017.
  20. Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data. arXiv preprint arXiv:2212.04821, 2022.
  21. EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound. In IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP), 2023.
  22. Visual prompt tuning. In European Conference on Computer Vision (ECCV), 2022.
  23. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  24. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
  25. With a little help from my temporal context: Multimodal egocentric action recognition. In Proc. BMVC, 2021a.
  26. Slow-fast auditory streams for audio recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021b.
  27. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16020–16030, 2021.
  28. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020.
  29. Efficient training of visual transformers with small datasets. Advances in Neural Information Processing Systems, 34, 2021.
  30. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  31. Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  32. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  33. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12486, 2021.
  34. Attention bottlenecks for multimodal fusion. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  35. Human action sequence classification. CoRR, abs/1910.02602, 2019.
  36. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
  37. Keeping your eye on the ball: Trajectory attention in video transformers. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  38. Perception test: A diagnostic benchmark for multimodal video models, 2023.
  39. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  40. Temporal aggregate representations for long-range video understanding. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
  41. Better aggregation in test-time augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
  42. Very deep convolutional networks for large-scale image recognition. ICLR, 2014.
  43. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
  44. Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020.
  45. Saic_cambridge-hupba-fbk submission to the epic-kitchens-100 action recognition challenge 2021. arXiv preprint arXiv:2110.02902, 2021.
  46. Nvidia-unibz submission for epic-kitchens-100 action anticipation challenge 2022. arXiv preprint arXiv:2206.10869, 2022.
  47. Audio-visual event localization in unconstrained videos. In ECCV, 2018.
  48. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proc. ECCV, 2020.
  49. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
  50. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
  51. Towards learning universal audio representations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4593–4597. IEEE, 2022a.
  52. What makes training multi-modal classification networks hard? In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  53. Internvideo: General video foundation models via generative and discriminative learning, 2022b.
  54. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
  55. Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
  56. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020.
  57. M&m mix: A multimodal multiview transformer ensemble. arXiv preprint arXiv:2206.09852, 2022.
  58. G-tad: Sub-graph localization for temporal action detection, 2020.
  59. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3333–3343, 2022.
  60. Temporal query networks for fine-grained video understanding. In Proc. CVPR, 2021.
  61. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, pages 492–510, 2022.
  62. Training a large video model on a single machine in a day. arXiv preprint arXiv:2309.16669, 2023.
  63. Learning video representations from large language models. In CVPR, 2023.
  64. Distance-iou loss: Faster and better learning for bounding box regression. In The AAAI Conference on Artificial Intelligence (AAAI), pages 12993–13000, 2020.
  65. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8436–8444, 2021.
  66. Contrastive positive sample propagation along the audio-visual event line. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jacob Chalk (4 papers)
  2. Jaesung Huh (24 papers)
  3. Evangelos Kazakos (13 papers)
  4. Andrew Zisserman (248 papers)
  5. Dima Damen (83 papers)
Citations (2)

Summary

Enhancing Audio-Visual Action Recognition with Time Interval Queries in Long Videos

Introduction to Time Interval Machine (TIM)

In the field of audio-visual action recognition, understanding the intricate dynamics between audio and visual signals in long videos is paramount. Different actions yield diverse audio-visual cues with varying temporal extents, presenting a unique challenge for accurate action recognition. Addressing this, the proposed Time Interval Machine (TIM) introduces a novel approach by focusing on the temporal dimensions of audio-visual events. TIM effectively models actions as queries within specific time intervals, thereby enhancing recognition accuracy.

Modality-Specific Time Interval Representation

Traditional techniques typically leverage trimmed clips or exact temporal spans of actions without considering the untrimmed, long video context. TIM distinguishes itself by treating time intervals as primary entities, integrating these with modality-specific features to form comprehensive queries. This allows TIM to exploit correlations between auditory and visual modalities, including their background context, thereby improving the recognition of ongoing actions. For instance, TIM can differentiate between simultaneous actions such as "Rinse Sponge" and "Water Flow," even when they overlap within the same modality.

Empirical Validation and Results

TIM has been rigorously evaluated on leading audio-visual datasets, namely EPIC-KITCHENS, Perception Test, and AVE, showcasing superior performance across the board. Notably, TIM achieves a 2.9% top-1 accuracy improvement over the previous state-of-the-art on EPIC-KITCHENS for action recognition. Additionally, TIM's versatility extends to action detection through dense interval queries, setting new benchmarks on multiple metrics in EPIC-KITCHENS-100 and demonstrating robust performance on the Perception Test.

  • On EPIC-KITCHENS: TIM outperforms the competition with a notable margin in action recognition accuracy. Its efficacy is underscored by surpassing models that leverage significantly larger pre-training datasets or sophisticated semantic technologies.
  • Adaptation for Action Detection: By employing dense multi-scale interval queries, TIM extends its capability to action detection, outclassing existing state-of-the-art methods in both precision and generalization across different datasets.

Theoretical Contributions and Practical Implications

TIM introduces several innovations and advancements in audio-visual action recognition:

  • The concept of modality-specific time interval queries enriches the model's understanding of long videos, accommodating the distinct temporal characteristics of audio and visual events.
  • The incorporation of context from both modalities, including periods of inactivity, contributes to a more nuanced recognition of events.
  • Achieving new state-of-the-art results across multiple datasets underscores TIM's effectiveness and potential for real-world applications in surveillance, content management, and interactive systems.

Speculating on Future Developments

The impressive results achieved by TIM pave the way for further exploration into the integration of temporal dynamics with audio-visual data. Future research could delve into:

  • The exploration of more granular time interval queries to capture subtler distinctions and overlaps in actions.
  • Leveraging the model's insights for tasks beyond recognition and detection, such as event prediction and temporal segmentation.
  • Investigating the fusion of TIM with other modalities, such as depth or tactile sensors, to enrich the model's perception of physical interactions.

Conclusion

The introduction of the Time Interval Machine (TIM) represents a significant advance in audio-visual action recognition, particularly in the context of long videos. Through its innovative use of modality-specific time interval queries, TIM not only achieves state-of-the-art results but also opens new avenues for research in video understanding and multimodal signal processing.

Youtube Logo Streamline Icon: https://streamlinehq.com