Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation (2301.09209v4)

Published 22 Jan 2023 in cs.CV and cs.CL

Abstract: We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vision-LLMs to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained LLMs add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using language-based context summaries in a task where vision seems to suffice. Our method outperforms state-of-the-art approaches by 40.4% in relative terms in overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at https://eth-ait.github.io/transfusion-proj/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Language2Pose: Natural Language Grounded Pose Forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728, 2019.
  2. Contextual string embeddings for sequence labeling. In Proceedings of the 27th international conference on computational linguistics, pages 1638–1649, 2018.
  3. VQA: Visual Question Answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  4. Layer Normalization, 2016.
  5. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. CoRR, abs/2104.00650, 2021.
  6. First-person action-object detection with egonet. In Proceedings of Robotics: Science and Systems, July 2017.
  7. Natural Language Processing With Python: Analyzing Text With the Natural Language Toolkit. ” O’Reilly Media, Inc.”, 2009.
  8. Jerome S Bruner. Intention in the structure of action and interaction. Advances in infancy research, 1981.
  9. Scaling instruction-finetuned language models, 2022.
  10. Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, 130(1):33–55, 2022.
  11. Rescaling Egocentric Vision. CoRR, abs/2006.13256, 2020.
  12. EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations. In Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022.
  13. TransVG: End-to-End Visual Grounding with Transformers. CoRR, abs/2104.08541, 2021.
  14. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805, 2018.
  15. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR, abs/2010.11929, 2020.
  16. SlowFast Networks for Video Recognition. CoRR, abs/1812.03982, 2018.
  17. Next-Active-Object prediction from Egocentric Videos. CoRR, abs/1904.05250, 2019.
  18. Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video. CoRR, abs/2005.02190, 2020.
  19. Anticipative Video Transformer. In ICCV, 2021.
  20. Ross B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
  21. Ego4D: Around the World in 3,000 Hours of Egocentric Video. CoRR, abs/2110.07058, 2021.
  22. Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385, 2015.
  23. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR, abs/1606.08415, 2016.
  24. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. CoRR, abs/2004.00849, 2020.
  25. The Kinetics Human Action Video Dataset. CoRR, abs/1705.06950, 2017.
  26. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. CoRR, abs/1908.08498, 2019.
  27. Findit: Generalized localization with natural language queries. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 502–520. Springer, 2022.
  28. Trajectory Prediction with Linguistic Representations. In 2022 International Conference on Robotics and Automation (ICRA), pages 2868–2875. IEEE, 2022.
  29. Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. CoRR, abs/2102.06183, 2021.
  30. Feature Pyramid Networks for Object Detection. CoRR, abs/1612.03144, 2016.
  31. Microsoft COCO: Common Objects in Context. CoRR, abs/1405.0312, 2014.
  32. On the Variance of the Adaptive Learning Rate and Beyond. CoRR, abs/1908.03265, 2019.
  33. Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity. CoRR, abs/1911.10967, 2019.
  34. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, abs/1907.11692, 2019.
  35. David W Orr. The Nature of Design: Ecology, Culture, and Human Intention. Oxford University Press, 2002.
  36. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.
  37. Learning Transferable Visual Models From Natural Language Supervision. CoRR, abs/2103.00020, 2021.
  38. Language models are unsupervised multitask learners. 2019.
  39. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR, abs/1910.10683, 2019.
  40. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR, abs/1908.10084, 2019.
  41. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. CoRR, abs/1506.01497, 2015.
  42. Paschal Sheeran. Intention-Behavior Relations: A Conceptual and Empirical Review. European review of social psychology, 12(1):1–36, 2002.
  43. Dropout: A Simple Way to Prevent Neural Networks From Overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
  44. Contrastive Bidirectional Transformer for Temporal Representation Learning. CoRR, abs/1906.05743, 2019.
  45. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. CoRR, abs/1908.07490, 2019.
  46. Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
  47. Attention Is All You Need. CoRR, abs/1706.03762, 2017.
  48. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. CoRR, abs/2202.03052, 2022.
  49. MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition. CoRR, abs/2201.08383, 2022.
  50. Learning to Anticipate Egocentric Actions by Imagination. CoRR, abs/2101.04924, 2021.
  51. Show, Attend and Tell: Neural Image Caption Generation With Visual Attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
  52. Boosting Image Captioning With Attributes. In Proceedings of the IEEE international conference on computer vision, pages 4894–4902, 2017.
  53. MERLOT: Multimodal Neural Script Knowledge Models. CoRR, abs/2106.02636, 2021.
  54. Leveraging Video Descriptions to Learn Video Question Answering. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  55. Video Summarization With Long Short-Term Memory. In European conference on computer vision, pages 766–782. Springer, 2016.
  56. Simple Multi-Dataset Detection. CoRR, abs/2102.13086, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Razvan-George Pasca (1 paper)
  2. Alexey Gavryushin (3 papers)
  3. Yen-Ling Kuo (22 papers)
  4. Luc Van Gool (569 papers)
  5. Otmar Hilliges (120 papers)
  6. Xi Wang (275 papers)
  7. Muhammad Hamza (12 papers)
  8. Kaichun Mo (41 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com