Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog (2402.13146v1)

Published 20 Feb 2024 in cs.CV

Abstract: We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into LLMs and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Neuro-Symbolic Visual Dialog. In COLING.
  2. VD-GR: Boosting Visual Dialog With Cascaded Spatial-Temporal Multi-Modal Graphs. In WACV.
  3. VQA: Visual Question Answering. In ICCV.
  4. Language models are few-shot learners. In NeurIPS.
  5. MONet: Unsupervised Scene Decomposition and Representation. In arXiv, 1901.11390.
  6. Visual Dialog. In CVPR.
  7. Attention over Learned Object Embeddings Enables Complex Visual Reasoning. In NeurIPS.
  8. Rohit Girdhar and Deva Ramanan. 2020. CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning. In ICLR.
  9. Dialog-based interactive image retrieval. In NeurIPS.
  10. Teaching Temporal Logics to Neural Networks. In ICLR.
  11. Visual Concept Metaconcept Learning. In NeurIPS.
  12. Mask R-CNN. In ICCV.
  13. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput.
  14. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In ICASSP.
  15. Video dialog via progressive inference and cross-transformer. In EMNLP-IJCNLP.
  16. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In CVPR.
  17. Deep visual-semantic alignments for generating image descriptions. In CVPR.
  18. SIMMC 2.0: A task-oriented dialog dataset for immersive multimodal conversations. In EMNLP.
  19. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog. In NAACL.
  20. Guillaume Lample and François Charton. 2020. Deep learning for symbolic mathematics. In ICLR.
  21. Learning reasoning paths over semantic graphs for video-grounded dialogues. In ICLR.
  22. Multimodal Dialogue State Tracking. In NAACL.
  23. Multimodal transformer networks for end-to-end video-grounded dialogue systems. In ACL.
  24. DVD: A diagnostic dataset for multi-step reasoning in video grounded dialogue. In ACL.
  25. RoBERTa: A Robustly Optimized BERT Pretraining Approach. In arXiv, 1907.11692.
  26. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In ICLR.
  27. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In ICLR.
  28. Situated and interactive multimodal conversations. In COLING.
  29. Neural belief tracker: Data-driven dialogue state tracking. In ACL.
  30. A fast and robust bert-based dialogue state tracker for schema-guided dialogue dataset. In Proceedings of the Workshop on Conversational Systems Towards Mainstream Adoption, KDD.
  31. Privacy risks of general-purpose language models. In Symposium on Security and Privacy.
  32. Wei Pang and Xiaojie Wang. 2020. Visual dialogue state tracking for question generation. In AAAI.
  33. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.
  34. Language models are unsupervised multitask learners. OpenAI blog.
  35. High-resolution image synthesis with latent diffusion models. In CVPR.
  36. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In Proceedings od the Workshop on Energy Efficient Machine Learning and Cognitive Computing, NeurIPS.
  37. Visually grounded neural syntax acquisition. In ACL.
  38. Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In ACM CCS.
  39. End-to-end optimization of goal-driven and visually grounded dialogue systems. In IJCAI.
  40. Attention is All you Need. In NeurIPS.
  41. Composing text and image for image retrieval-an empirical odyssey. In CVPR.
  42. Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comput.
  43. Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation. In ACM RecSys.
  44. Msr-vtt: A large video description dataset for bridging video and language. In CVPR.
  45. An End-to-end Approach for Handling Unknown Slot Values in Dialogue State Tracking. In ACL-findings.
  46. CLEVRER: Collision Events for Video Representation and Reasoning. In ICLR.
  47. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. In NeurIPS.
  48. Multi-turn video question answering via multi-stream hierarchical attention context network. In IJCAI.
  49. Deep leakage from gradients. In NeurIPS.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Adnen Abdessaied (6 papers)
  2. Manuel von Hochmeister (1 paper)
  3. Andreas Bulling (81 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.