TIM: A Time Interval Machine for Audio-Visual Action Recognition (2404.05559v2)
Abstract: Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Diffsed: Sound event detection with denoising diffusion. arXiv preprint arXiv:2308.07293, 2023.
- Soft-nms – improving object detection with one line of code, 2017.
- End-to-end object detection with transformers. In Proc. ECCV, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Rescaling egocentric vision: Collection pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 2021.
- SlowFast Networks for Video Recognition. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
- Css-net: A consistent segment selection network for audio-visual event localization. IEEE Transactions on Multimedia, 2023.
- Listen to look: Action recognition by previewing audio. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- Omnivore: A Single Model for Many Visual Modalities. In CVPR, 2022.
- Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10699–10709, 2022.
- The ”something something” video database for learning and evaluating visual common sense, 2017.
- Ava: A video dataset of spatio-temporally localized atomic visual actions, 2018.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Cnn architectures for large-scale audio classification. In ICASSP, 2017.
- Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data. arXiv preprint arXiv:2212.04821, 2022.
- EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound. In IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP), 2023.
- Visual prompt tuning. In European Conference on Computer Vision (ECCV), 2022.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of International Conference on Computer Vision (ICCV), 2019.
- With a little help from my temporal context: Multimodal egocentric action recognition. In Proc. BMVC, 2021a.
- Slow-fast auditory streams for audio recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021b.
- Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16020–16030, 2021.
- Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327, 2020.
- Efficient training of visual transformers with small datasets. Advances in Neural Information Processing Systems, 34, 2021.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
- Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12475–12486, 2021.
- Attention bottlenecks for multimodal fusion. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Human action sequence classification. CoRR, abs/1910.02602, 2019.
- Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
- Keeping your eye on the ball: Trajectory attention in video transformers. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Perception test: A diagnostic benchmark for multimodal video models, 2023.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Temporal aggregate representations for long-range video understanding. In Proceedings of European Conference on Computer Vision (ECCV), 2020.
- Better aggregation in test-time augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
- Very deep convolutional networks for large-scale image recognition. ICLR, 2014.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
- Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020.
- Saic_cambridge-hupba-fbk submission to the epic-kitchens-100 action recognition challenge 2021. arXiv preprint arXiv:2110.02902, 2021.
- Nvidia-unibz submission for epic-kitchens-100 action anticipation challenge 2022. arXiv preprint arXiv:2206.10869, 2022.
- Audio-visual event localization in unconstrained videos. In ECCV, 2018.
- Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proc. ECCV, 2020.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
- Towards learning universal audio representations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4593–4597. IEEE, 2022a.
- What makes training multi-modal classification networks hard? In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Internvideo: General video foundation models via generative and discriminative learning, 2022b.
- Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 284–293, 2019.
- Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13587–13597, 2022.
- Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740, 2020.
- M&m mix: A multimodal multiview transformer ensemble. arXiv preprint arXiv:2206.09852, 2022.
- G-tad: Sub-graph localization for temporal action detection, 2020.
- Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3333–3343, 2022.
- Temporal query networks for fine-grained video understanding. In Proc. CVPR, 2021.
- Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, pages 492–510, 2022.
- Training a large video model on a single machine in a day. arXiv preprint arXiv:2309.16669, 2023.
- Learning video representations from large language models. In CVPR, 2023.
- Distance-iou loss: Faster and better learning for bounding box regression. In The AAAI Conference on Artificial Intelligence (AAAI), pages 12993–13000, 2020.
- Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8436–8444, 2021.
- Contrastive positive sample propagation along the audio-visual event line. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Jacob Chalk (4 papers)
- Jaesung Huh (24 papers)
- Evangelos Kazakos (13 papers)
- Andrew Zisserman (248 papers)
- Dima Damen (83 papers)