Introduction
Vision-LLMs (VLMs) are known for their capacity to handle and interpret multimodal tasks, where data inputs can be both textual and visual. They perform complex reasoning tasks by incorporating data from different sources, like images and text, often outperforming text-only LLMs. But when VLMs are tasked with unimodal challenges, particularly math and general-purpose reasoning questions, their performance potential is not fully realized as these problems appear exclusively text-based.
Self-Imagination in VLMs
A recent technique known as SELF-IMAGINE seeks to bridge this gap. The technique mimics the human capacity for solving problems by first visualizing them and then using the visual aid to deduce solutions. It uses a single VLM to transform a textual query into a visual diagram by converting the query into HTML code. This HTML code is then rendered into an image, which, when combined with the original text query, allows the VLM to leverage both text and visual information. Remarkably, this method doesn't need additional training data or training efforts.
Experimental Findings
The efficacy of SELF-IMAGINE was evaluated through tasks in mathematics and general-purpose reasoning. Improvements were observed across all tested mathematical reasoning tasks and the majority of general-purpose reasoning tasks. Notably, performance gains ranged from slight to significantly higher, demonstrating the approach's robust capability to boost VLM performance with self-generated imagery. However, some tasks showed a decrease in performance when the image generation inadequately captured the necessary information, underscoring the importance of generating accurate visual representations that align with the problem-solving process.
Conclusions
SELF-IMAGINE exemplifies how properly crafted visual representations can facilitate enhanced reasoning in VLMs on text-heavy tasks. The results substantiate the importance of quality in the image generation process, revealing that VLM performance improvements are contingent on the images' ability to accurately reflect and simplify the reasoning sequence. The findings from SELF-IMAGINE suggest that while images can be remarkably beneficial for reasoning in VLMs, further research is needed to improve image generation techniques to fully harness their potential in problem-solving scenarios.