Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination (2401.08025v2)

Published 16 Jan 2024 in cs.AI, cs.CL, and cs.LG

Abstract: The potential of Vision-LLMs (VLMs) often remains underutilized in handling complex text-based problems, particularly when these problems could benefit from visual representation. Resonating with humans' ability to solve complex text-based problems by (1) creating a visual diagram from the problem and (2) deducing what steps they need to take to solve it, we propose Self-Imagine. We leverage a single Vision-LLM (VLM) to generate a structured representation of the question using HTML, then render the HTML as an image, and finally use the same VLM to answer the question using both the question and the image. Our approach does not require any additional training data or training. We evaluate our approach on three mathematics tasks and nine general-purpose reasoning tasks using state-of-the-art (LLAVA-1.5 and GEMINI PRO) VLMs. Our approach boosts the performance of LLAVA-1.5 and GEMINI PRO on all math tasks (on average GSM8K: +3.1%; ASDIV: +3.2%; SVAMP: +6.9%) and the majority of the general-purpose reasoning tasks by 3.2% to 6.0% on average.

PDF Abstract

Introduction

Vision-LLMs (VLMs) are known for their capacity to handle and interpret multimodal tasks, where data inputs can be both textual and visual. They perform complex reasoning tasks by incorporating data from different sources, like images and text, often outperforming text-only LLMs. But when VLMs are tasked with unimodal challenges, particularly math and general-purpose reasoning questions, their performance potential is not fully realized as these problems appear exclusively text-based.

Self-Imagination in VLMs

A recent technique known as SELF-IMAGINE seeks to bridge this gap. The technique mimics the human capacity for solving problems by first visualizing them and then using the visual aid to deduce solutions. It uses a single VLM to transform a textual query into a visual diagram by converting the query into HTML code. This HTML code is then rendered into an image, which, when combined with the original text query, allows the VLM to leverage both text and visual information. Remarkably, this method doesn't need additional training data or training efforts.

Experimental Findings

The efficacy of SELF-IMAGINE was evaluated through tasks in mathematics and general-purpose reasoning. Improvements were observed across all tested mathematical reasoning tasks and the majority of general-purpose reasoning tasks. Notably, performance gains ranged from slight to significantly higher, demonstrating the approach's robust capability to boost VLM performance with self-generated imagery. However, some tasks showed a decrease in performance when the image generation inadequately captured the necessary information, underscoring the importance of generating accurate visual representations that align with the problem-solving process.

Conclusions

SELF-IMAGINE exemplifies how properly crafted visual representations can facilitate enhanced reasoning in VLMs on text-heavy tasks. The results substantiate the importance of quality in the image generation process, revealing that VLM performance improvements are contingent on the images' ability to accurately reflect and simplify the reasoning sequence. The findings from SELF-IMAGINE suggest that while images can be remarkably beneficial for reasoning in VLMs, further research is needed to improve image generation techniques to fully harness their potential in problem-solving scenarios.