DePlot: One-shot visual language reasoning by plot-to-table translation (2212.10505v2)

Published 20 Dec 2022 in cs.CL, cs.AI, and cs.CV

Abstract: Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained LLM, exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.

PDF Abstract

Towards Efficient Visual Language Reasoning: An Analysis of DEPLOT

The paper "DEPLOT: One-shot visual language reasoning by plot-to-table translation" addresses the challenge of reasoning over visual language mediums, such as charts and plots, with a focus on improving the efficiency and capability of understanding complex queries. The authors present a novel approach that deconstructs visual language reasoning into two primary processes: translation of plot images into tables and leveraging the capabilities of LLMs for reasoning over these converted tables. This approach is embodied in DEPLOT, a modality conversion module, that significantly enhances processing proficiency with one-shot examples, circumventing the data-intensive demands of previous state-of-the-art models.

The research community has observed that earlier models required copious amounts of training data and often struggled with intricate human queries. The proposed DEPLOT+LLM framework distinguishes itself by achieving a 29.4% improvement in accuracy over the state-of-the-art model MATCHA when evaluated on human-generated queries in the ChartQA dataset. By decomposing the reasoning task into modality conversion followed by linguistic reasoning, DEPLOT uniquely harnesses the inherent few-shot reasoning abilities of contemporary LLMs.

Methodological Innovations and Technical Merits

A pivotal facet of this research is the introduction of the DEPLOT modality conversion module, which efficiently translates plots and charts into linearized tables. This conversion is facilitated by training an image-to-text Transformer model, stratifying the plot-to-table task using unified task formats and metrics. The translation employs a combination of synthetic and real-world data to ensure robustness.

The work capitalizes on sophisticated prompting techniques for LLMs, utilizing Chain of Thoughts (CoT), Self-Consistency (SC), and Program of Thoughts (PoT) to enhance reasoning accuracies. These techniques enable the LLMs to infer more precisely by providing structural guidance in problem-solving.

Implications and Future Directions

The results on ChartQA, especially with human-authored queries, reveal the potency of DEPLOT+LLM in dealing with linguistic reasoning tasks that require comprehension of complex human logic. The modular decomposition of tasks into visual translation and textual reasoning exemplifies a shift towards more flexible, robust AI models. This has implications for numerous applications that require extracting logical data from visual formats, such as automated reporting and data digitization.

However, the DEPLOT+LLM framework exhibits limitations with synthetic datasets like PlotQA, where the structural biases of training data could not be leveraged, highlighting the model's dependency on real-world query diversity for peak performance. Future research could delve into incorporating visual attributes such as color, orientation, and style within plot translations to mitigate information loss during conversion.

Conclusion

The paper exhibits a substantial advancement in the landscape of visual language reasoning by charting a path towards a plug-and-play model that minimizes the need for extensive labeled datasets. DEPLOT+LLM not only performs remarkably on real-world complex queries but also suggests a modular approach that could be taken forward in multimodal AI research. This indicates a promising trajectory for integrating LLMs in various multimodal reasoning applications, aligning with ongoing endeavors to deepen AI's understanding of sophisticated visual and linguistic information.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Fangyu Liu (59 papers)
Julian Martin Eisenschlos (27 papers)
Francesco Piccinno (15 papers)
Syrine Krichene (10 papers)
Chenxi Pang (4 papers)
Kenton Lee (40 papers)
Mandar Joshi (24 papers)
Wenhu Chen (134 papers)
Nigel Collier (83 papers)
Yasemin Altun (12 papers)

Citations (75)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos