Navigating Hallucinations for Reasoning of Unintentional Activities (2402.19405v2)
Abstract: In this work we present a novel task of understanding unintentional human activities in videos. We formalize this problem as a reasoning task under zero-shot scenario, where given a video of an unintentional activity we want to know why it transitioned from intentional to unintentional. We first evaluate the effectiveness of current state-of-the-art Large Multimodal Models on this reasoning task and observe that they suffer from hallucination. We further propose a novel prompting technique,termed as Dream of Thoughts (DoT), which allows the model to navigate through hallucinated thoughts to achieve better reasoning. To evaluate the performance on this task, we also introduce three different specialized metrics designed to quantify the models reasoning capability. We perform our experiments on two different datasets, OOPs and UCF-Crimes, and our findings show that DOT prompting technique is able to outperform standard prompting, while minimizing hallucinations.
- Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems, 29, 2016.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018.
- A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.
- Chain-of-verification reduces hallucination in large language models, 2023.
- Oops! predicting unintentional action in video. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.
- Google. Bard. https://bard.google.com, 2023. Accessed: 2023-11-12.
- Learning latent dynamics for planning from pixels. In International conference on machine learning, pages 2555–2565. PMLR, 2019.
- Bilinear attention networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018.
- Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, pages 9459–9474. Curran Associates, Inc., 2020.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023b.
- Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387, 2021.
- Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. In Proceedings of the ACM Web Conference 2023, pages 417–428, 2023c.
- Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023.
- Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14111–14121, 2021.
- A review of emerging research directions in abstract visual reasoning. Information Fusion, 91:713–736, 2023.
- Sources of hallucination by large language models on inference tasks, 2023.
- Report on a general problem-solving program. In IFIP Congress, 1959.
- Human problem solving. Prentice-hall Englewood Cliffs, NJ, 1972.
- OpenAI. Chatgpt: Version classic. https://openai.com, 2023. Accessed: 2023-11-12.
- Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
- Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Evaluation and analysis of hallucination in large vision-language models, 2023.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9):1526–1541, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, pages 24824–24837. Curran Associates, Inc., 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
- 3d-intphys: Towards more generalized 3d-grounded visual intuitive physics under challenging scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3625–3635, 2023.
- Llm lies: Hallucinations are not bugs, but features as adversarial examples, 2023a.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023b.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
- Toward multi-granularity decision-making: Explicit visual reasoning with hierarchical knowledge. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2573–2583, 2023b.
- Automatic chain of thought prompting in large language models, 2022.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023c.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.