BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models
Abstract: Multimodal reasoning stands as a pivotal capability for large vision-LLMs (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively leveraging the unique strengths of visual and DSL representations, primarily due to their differing reasoning mechanisms. Additionally, it often falls short in addressing critical steps in multi-step reasoning tasks. To mitigate these challenges, we introduce the \underline{B}i-Modal \underline{B}ehavioral \underline{A}lignment (BBA) prompting method, designed to maximize the potential of DSL in augmenting complex multi-modal reasoning tasks. This method initiates by guiding LVLMs to create separate reasoning chains for visual and DSL representations. Subsequently, it aligns these chains by addressing any inconsistencies, thus achieving a cohesive integration of behaviors from different modalities. Our experiments demonstrate that BBA substantially improves the performance of GPT-4V(ision) on geometry problem solving ($28.34\% \to 34.22\%$), chess positional advantage prediction ($42.08\% \to 46.99\%$) and molecular property prediction ($77.47\% \to 83.52\%$).
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- John C Bowman and Andy Hammerlindl. 2008. Asymptote: A vector graphics language. TUGboat: The Communications of the TEX Users Group, 29(2):288–294.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
- Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500, 2.
- Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797.
- Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
- Steven J Edwards. 1994. Portable game notation specification and implementation guide. Retrieved April, 4:2011.
- Towards revealing the mystery behind chain of thought: a theoretical perspective. arXiv preprint arXiv:2305.15408.
- Baldur: Whole-proof generation and repair with large language models. arXiv preprint arXiv:2303.04910.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
- The activitynet large-scale activity recognition challenge 2018 summary. arXiv preprint arXiv:1808.03766.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842.
- Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Fine-tuning multimodal llms to follow zeroshot demonstrative instructions. arXiv preprint arXiv:2308.04152, 3.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Mengchen Liu and Chongyan Chen. 2023. An evaluation of gpt-4v and gemini in online vqa. arXiv preprint arXiv:2312.10637.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- Vim: Probing multimodal large language models for visual embedded instruction following. arXiv preprint arXiv:2311.17647.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119.
- William Merrill and Ashish Sabharwal. 2023. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923.
- Compositional chain-of-thought prompting for large multimodal models. arXiv preprint arXiv:2311.17076.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- Elementary students’ cognitive and affective responses to impasses during mathematics problem solving. Journal of Educational Psychology, 113(1):104.
- OpenAI. 2023. Gpt-4 technical report.
- Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European conference on computer vision (ECCV), pages 631–648.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
- Grandmaster-level chess without search. arXiv preprint arXiv:2402.04494.
- Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Filling the image information gap for vqa: Prompting large language models to proactively ask questions. arXiv preprint arXiv:2311.11598.
- Chain-of-thought prompting elicits reasoning in large language models.
- David Weininger. 1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36.
- A smile is all you need: predicting limiting activity coefficients from smiles with natural language processing. Digital Discovery, 1(6):859–869.
- Penghao Wu and Saining Xie. 2023. V*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135.
- An early evaluation of gpt-4v (ision). arXiv preprint arXiv:2310.16534.
- The role of chain-of-thought in complex vision-language reasoning task. arXiv preprint arXiv:2311.09193.
- Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634.
- Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7572–7590.
- Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV), pages 670–685.
- Good questions help zero-shot image reasoning. arXiv preprint arXiv:2312.01598.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1).
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- Diversifying ai: Towards creative chess with alphazero. arXiv preprint arXiv:2308.09175.
- Cocot: Contrastive chain-of-thought prompting for large multimodal models with multiple image inputs. arXiv preprint arXiv:2401.02582.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915.
- Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. arXiv preprint arXiv:2310.16436.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.