Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Glance and Focus: Memory Prompting for Multi-Event Video Question Answering (2401.01529v1)

Published 3 Jan 2024 in cs.CV

Abstract: Video Question Answering (VideoQA) has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Despite the recent success of large vision LLMs in many multi-modal tasks, complex situation reasoning over videos involving multiple human-object interaction events still remains challenging. In contrast, humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning. To mimic this effective reasoning strategy, we propose the Glance-Focus model. One simple way is to apply an action detection model to predict a set of actions as key memories. However, these actions within a closed set vocabulary are hard to generalize to various video domains. Instead of that, we train an Encoder-Decoder to generate a set of dynamic event memories at the glancing stage. Apart from using supervised bipartite matching to obtain the event memories, we further design an unsupervised memory generation method to get rid of dependence on event annotations. Next, at the focusing stage, these event memories act as a bridge to establish the correlation between the questions with high-level event concepts and low-level lengthy video content. Given the question, the model first focuses on the generated key event memory, then focuses on the most relevant moment for reasoning through our designed multi-level cross-attention mechanism. We conduct extensive experiments on four Multi-Event VideoQA benchmarks including STAR, EgoTaskQA, AGQA, and NExT-QA. Our proposed model achieves state-of-the-art results, surpassing current large models in various challenging reasoning tasks. The code and models are available at https://github.com/ByZ0e/Glance-Focus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  3. Revisiting the “video” in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022.
  4. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, pages 213–229, 2020.
  5. Hierarchical object-oriented spatio-temporal reasoning for video question answering. International Joint Conference on Artificial Intelligence, 2021.
  6. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1999–2007, 2019.
  7. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019.
  8. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  9. Env-qa: a video question answering benchmark for comprehensive understanding of dynamic environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1675–1685, 2021.
  10. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. arXiv preprint arXiv:2212.09522, 2022.
  11. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6576–6585, 2018.
  12. Bridgeformer: Bridging video-text retrieval with multiple choice questions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  13. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  14. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021.
  15. Location-aware graph convolutional networks for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11021–11028, 2020.
  16. Egotaskqa: Understanding human tasks in egocentric videos. Advances in Neural Information Processing Systems, 2022.
  17. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11109–11116, 2020.
  18. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
  19. Semi-parametric video-grounded text generation. arXiv preprint arXiv:2301.11507, 2023.
  20. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
  21. Discriminative clustering by regularized information maximization. Advances in Neural Information Processing Systems, 23, 2010.
  22. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123:32–73, 2017.
  23. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9972–9981, 2020.
  24. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021.
  25. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7331–7341, 2021.
  26. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  27. Bridge-prompt: Towards ordinal action understanding in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19880–19889, 2022.
  28. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2928–2937, 2022.
  29. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039, 2020.
  30. Hair: Hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1698–1707, 2021.
  31. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  32. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  33. Ego-topo: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 163–172, 2020.
  34. Activity graph transformer for temporal action localization. arXiv preprint arXiv:2101.08540, 2021.
  35. Bridge to answer: Structure-aware graph interaction network for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15526–15535, 2021.
  36. Locate before answering: Answer guided question localization for video question answering. IEEE Transactions on Multimedia, 2023.
  37. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, pages 91–99, 2015.
  38. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4489–4497, 2015.
  39. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  40. All in one: Exploring unified video-language pre-training. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  41. Dualvgr: A dual-visual graph reasoning unit for video question answering. IEEE Transactions on Multimedia, pages 3369–3380, 2021.
  42. Star: A benchmark for situated reasoning in real-world videos. In Advances in Neural Information Processing Systems, 2021.
  43. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9777–9786, 2021.
  44. Video as conditional graph hierarchy for multi-granular question answering. Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  45. Video graph transformer for video question answering. In Proceedings of the European Conference on Computer Vision, pages 39–58, 2022.
  46. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision, pages 305–321, 2018.
  47. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021.
  48. Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.
  49. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6281–6290, 2019.
  50. Merlot reserve: Multimodal neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  51. Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing, pages 5939–5952, 2019.
  52. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8746–8755, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ziyi Bai (1 paper)
  2. Ruiping Wang (32 papers)
  3. Xilin Chen (119 papers)
Citations (5)