PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain (2402.15527v1)
Abstract: We present PCA-Bench, a multimodal decision-making benchmark for evaluating the integrated capabilities of Multimodal LLMs (MLLMs). Departing from previous benchmarks focusing on simplistic tasks and individual model capability, PCA-Bench introduces three complex scenarios: autonomous driving, domestic robotics, and open-world games. Given task instructions and diverse contexts, the model is required to seamlessly integrate multiple capabilities of Perception, Cognition, and Action in a reasoning chain to make accurate decisions. Moreover, PCA-Bench features error localization capabilities, scrutinizing model inaccuracies in areas such as perception, knowledge, or reasoning. This enhances the reliability of deploying MLLMs. To balance accuracy and efficiency in evaluation, we propose PCA-Eval, an automatic evaluation protocol, and assess 10 prevalent MLLMs. The results reveal significant performance disparities between open-source models and powerful proprietary models like GPT-4 Vision. To address this, we introduce Embodied-Instruction-Evolution (EIE), an automatic framework for synthesizing instruction tuning examples in multimodal embodied environments. EIE generates 7,510 training examples in PCA-Bench and enhances the performance of open-source MLLMs, occasionally surpassing GPT-4 Vision (+3\% in decision accuracy), thereby validating the effectiveness of EIE. Our findings suggest that robust MLLMs like GPT4-Vision show promise for decision-making in embodied agents, opening new avenues for MLLM research.
- Vqa: Visual question answering. International Journal of Computer Vision, 123:4 – 31.
- Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966.
- Visit-bench: A benchmark for vision-language instruction following inspired by real-world use.
- Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond. ArXiv.
- Scaling instruction-finetuned language models.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500.
- Palm-e: An embodied multimodal language model.
- Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
- Joaquin M. Fuster. 2004. Upper processing stages of the perception–action cycle. Trends in Cognitive Sciences, 8(4):143–145.
- Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334.
- Vizwiz grand challenge: Answering visual questions from blind people. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3608–3617.
- The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125.
- Camel: Communicative agents for "mind" exploration of large language model society.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387.
- Api-bank: A benchmark for tool-augmented llms.
- Evaluating object hallucination in large vision-language models.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models.
- Visual spatial reasoning. Transactions of the Association for Computational Linguistics.
- Visual instruction tuning.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
- Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
- Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
- Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks.
- OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 3195–3204.
- Webgpt: Browser-assisted question-answering with human feedback.
- OpenAI. 2023. Gpt-4v(ision) system card.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.
- Tool learning with foundation models. arXiv preprint arXiv:2304.08354.
- Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv.
- Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition. arXiv preprint arXiv:2304.04704.
- Toolformer: Language models can teach themselves to use tools.
- ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Restgpt: Connecting large language models with real-world restful apis.
- A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 217–223.
- Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Self-instruct: Aligning language models with self-generated instructions.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. ArXiv, abs/2302.01560.
- Chain-of-thought prompting elicits reasoning in large language models.
- The rise and potential of large language model based agents: A survey.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
- Retroformer: Retrospective large language agents with policy gradient optimization. arXiv preprint arXiv:2308.02151.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Traffic-sign detection and classification in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Liang Chen (360 papers)
- Yichi Zhang (184 papers)
- Shuhuai Ren (30 papers)
- Haozhe Zhao (19 papers)
- Zefan Cai (26 papers)
- Yuchi Wang (11 papers)
- Peiyi Wang (48 papers)
- Xiangdi Meng (6 papers)
- Tianyu Liu (177 papers)
- Baobao Chang (80 papers)