Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models (2312.08762v1)

Published 14 Dec 2023 in cs.AI

Abstract: Chain-of-thought (CoT) reasoning has exhibited impressive performance in LLMs for solving complex tasks and answering questions. However, many real-world questions require multi-modal information, such as text and images. Previous research on multi-modal CoT has primarily focused on extracting fixed image features from off-the-shelf vision models and then fusing them with text using attention mechanisms. This approach has limitations because these vision models were not designed for complex reasoning tasks and do not align well with language thoughts. To overcome this limitation, we introduce a novel approach for multi-modal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT. We demonstrate the efficacy of our proposed method on multi-modal ScienceQA and machine translation benchmarks, achieving state-of-the-art performance on ScienceQA. Overall, our approach offers a more robust and effective solution for multi-modal reasoning in LLMs, enhancing their ability to tackle complex real-world problems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Liqi He (2 papers)
  2. Zuchao Li (76 papers)
  3. Xiantao Cai (13 papers)
  4. Ping Wang (289 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.