Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sequential Compositional Generalization in Multimodal Models (2404.12013v1)

Published 18 Apr 2024 in cs.CL

Abstract: The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{http://cyberiada.github.io/CompAct}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Ekin Akyurek and Jacob Andreas. 2021. Lexicon learning for few shot sequence modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4934–4946, Online. Association for Computational Linguistics.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv, abs/2308.01390.
  4. Systematic generalization: What is required and can it be learned? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  5. Moshe Bar. 2007. The proactive brain: using analogies and associations to generate predictions. Trends in cognitive sciences, 11(7):280–289.
  6. Marco Baroni. 2020. Linguistic generalization and compositionality in modern artificial neural networks. Phil. Trans. R. Soc. B, 375(1791):20190307.
  7. COVR: A test-bed for visually grounded compositional generalization with real images. CoRR, abs/2109.10613.
  8. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE.
  9. Distilling audio-visual knowledge by compositional contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7016–7025.
  10. Measures of distance between probability distributions. Journal of mathematical analysis and applications, 138(1):280–292.
  11. Andy Clark. 2015. Surfing uncertainty: Prediction, action, and the embodied mind. Oxford University Press.
  12. The devil is in the detail: Simple tricks improve systematic generalization of transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 619–634. Association for Computational Linguistics.
  13. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55.
  14. Evaluating compositionality in sentence embeddings. CoRR, abs/1802.04302.
  15. CLOSURE: assessing systematic generalization of CLEVR models. In Visually Grounded Interaction and Language (ViGIL), NeurIPS 2019 Workshop, Vancouver, Canada, December 13, 2019.
  16. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  17. Assessing composition in sentence vector representations. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 1790–1801. Association for Computational Linguistics.
  18. Predicting the future: A jointly learnt model for action anticipation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5562–5571.
  19. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190.
  20. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012.
  21. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE conference on CVPR, pages 770–778.
  23. Emergent systematic generalization in a situated agent. CoRR, abs/1910.00571.
  24. Visually grounded continual learning of compositional semantics. CoRR, abs/2005.00785.
  25. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1988–1997. IEEE Computer Society.
  26. Time-conditioned action anticipation in one shot. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9925–9934.
  27. Measuring compositional generalization: A comprehensive method on realistic data. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  28. Najoung Kim and Tal Linzen. 2020. COGS: A compositional generalization challenge based on semantic interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105, Online. Association for Computational Linguistics.
  29. Uncontrolled lexical exposure leads to overestimation of compositional generalization in pretrained models. arXiv preprint arXiv:2212.10769.
  30. Brenden M. Lake. 2019. Compositional generalization through meta sequence-to-sequence learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 9788–9798.
  31. Brenden M. Lake and Marco Baroni. 2017. Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks. CoRR, abs/1711.00350.
  32. Brenden M. Lake and Marco Baroni. 2018. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 2879–2888. PMLR.
  33. Human few-shot learning of compositional instructions. In Proceedings of the 41th Annual Meeting of the Cognitive Science Society, CogSci 2019: Creativity + Cognition + Computation, Montreal, Canada, July 24-27, 2019, pages 611–617. cognitivesciencesociety.org.
  34. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527.
  35. Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems, 34:29348–29363.
  36. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  37. Maqa: A multimodal qa benchmark for negation. In NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.
  38. Gain: On the generalization of instructional action understanding. In The Eleventh International Conference on Learning Representations.
  39. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer.
  40. Streamingqa: A benchmark for adaptation to new knowledge over time in question answering models. In International Conference on Machine Learning, pages 13604–13622. PMLR.
  41. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.
  42. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  43. Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10910–10921.
  44. Compositional generalization in image captioning. In Proceedings of the 23rd Conference on Computational Natural Language Learning, CoNLL 2019, Hong Kong, China, November 3-4, 2019, pages 87–98. Association for Computational Linguistics.
  45. On "scientific debt" in nlp: A case for more rigour in language model pre-training research. arXiv preprint arXiv:2306.02870.
  46. OpenAI. 2023. Gpt-4 technical report.
  47. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  48. Improving compositional generalization with latent structure and data augmentation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4341–4362, Seattle, United States. Association for Computational Linguistics.
  49. Evaluating the impact of model scale for compositional generalization in semantic parsing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9157–9179.
  50. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149.
  51. A benchmark for systematic generalization in grounded language understanding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  52. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252.
  53. Look before you speak: Visually contextualized utterances. CoRR, abs/2012.05710.
  54. Learning to learn words from visual scenes. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIX, volume 12374 of Lecture Notes in Computer Science, pages 434–452. Springer.
  55. Shifting the baseline: Single modality performance on visual navigation & QA. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1977–1983, Minneapolis, Minnesota. Association for Computational Linguistics.
  56. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248.
  57. Llama 2: Open foundation and fine-tuned chat models.
  58. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy. ACL.
  59. Iterated learning for emergent systematicity in VQA. In International Conference on Learning Representations.
  60. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  61. Understanding multimodal procedural knowledge by sequencing multimodal instructional manuals. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4525–4542.
  62. Reascan: Compositional reasoning in language grounding. arXiv preprint arXiv:2109.08994.
  63. Zero-shot compositional concept learning. In Proceedings of the 1st Workshop on Meta Learning and Its Applications to Natural Language Processing, pages 19–27, Online. Association for Computational Linguistics.
  64. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  65. Do vision-language pretrained models learn composable primitive concepts? Trans. Mach. Learn. Res., 2023.
  66. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387.
  67. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  68. Non-sequential graph script induction via multimedia grounding. arXiv preprint arXiv:2305.17542.
  69. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545.
  70. ViLPAct: A benchmark for compositional generalization on multimodal human activities. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2192–2207, Dubrovnik, Croatia. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Semih Yagcioglu (5 papers)
  2. Osman Batur İnce (3 papers)
  3. Aykut Erdem (46 papers)
  4. Erkut Erdem (46 papers)
  5. Desmond Elliott (53 papers)
  6. Deniz Yuret (26 papers)

Summary

Evaluation of Multimodal Models on Sequential Compositional Generalization in Egocentric Videos

Introduction to the Study

In this paper, researchers explore the effectiveness of multimodal models in handling sequential compositional generalization, a task based on the CompAct dataset which comprises egocentric kitchen activity videos paired with audio cues and textual descriptions. The focus lies on how well unimodal and multimodal approaches manage to generate and understand novel combinations of previously learned elements.

Description and Significance of the CompAct Dataset

The CompAct dataset was developed specifically for this paper and utilizes sequences from EPIC KITCHENS-100 (EK-100). Each video in this dataset captures unscripted kitchen activities from a first-person perspective and features accompanying audio and textual data. Importantly, the train and test sets exhibit similar distributions of verbs and objects (atoms), but different combinations of these atoms, setting the stage for rigorous evaluation of compositional generalization.

Key Features

  • Multimodal Composition: Each instance combines video, audio, and text.
  • Compositional Splits: Training on known atoms (verbs/objects) but novel combinations in test settings.
  • Rich Annotations: Linguistic descriptions connect closely with the visual and auditory data, offering a trio of synchronized modalities for analysis.

Methodology and Model Evaluation

Researchers undertook a comprehensive assessment of several models ranging from unimodal text-only models to sophisticated multimodal systems integrating video, audio, and text.

Tasks Designed for Assessment

  • Next Utterance Prediction: Models predict a textual description for the next unseen video segment based on past sequences.
  • Atom Classification: Direct classification of verbs and objects, focusing on recognizing elements in isolation.

Models Tested

Several types of models were considered:

  • Baseline Models: These include unimodal and various forms of multimodal configurations (e.g., text-only, video-text, audio-text).
  • Pretrained Models: Large-scale models like LLaMA2 and ImageBind, which have been pretrained with extensive multimodal data.

Findings and Implications

The results suggest that multimodal models generally outperform their unimodal counterparts, especially when combining video, audio, and textual modalities. Pretrained models like ImageBind showed significant prowess, likely benefiting from their extensive multimodal pretraining regimes. Notably, the paper found that all models struggled to some degree with true compositional generalization — interpreting entirely novel combinations of familiar elements.

Models' Generalization Capabilities

  • Although there were improvements with multimodal inputs, genuine compositional tasks remained challenging.
  • Pretrained models did not always perform consistently, indicating possible limitations in their training or adaptation phases.

Future Research and Theoretical Contributions

This paper raises several questions for future research:

  • Role of Grounding: How does grounding in real-world audio-visual data affect the learning dynamics and capabilities of generative models?
  • Model Architectures: What architectural innovations are necessary to better support compositional generalization?

Final Thoughts

The results underscore the complexity of compositional generalization and highlight the need for further investigation into how multimodal models can be better engineered and trained to handle such tasks effectively. As AI systems increasingly move towards real-world applications, the ability to generalize over novel yet logically related combinations of learned components will be crucial.

X Twitter Logo Streamline Icon: https://streamlinehq.com