CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts (2408.14419v3)

Published 26 Aug 2024 in cs.AI, cs.CL, and cs.CV

Abstract: We introduce CHARTOM, a visual theory-of-mind benchmark designed to evaluate multimodal LLMs' capability to understand and reason about misleading data visualizations though charts. CHARTOM consists of carefully designed charts and associated questions that require a LLM to not only correctly comprehend the factual content in the chart (the FACT question) but also judge whether the chart will be misleading to a human readers (the MIND question), a dual capability with significant societal benefits. We detail the construction of our benchmark including its calibration on human performance and estimation of MIND ground truth called the Human Misleadingness Index. We evaluated several leading LLMs -- including GPT, Claude, Gemini, Qwen, Llama, and Llava series models -- on the CHARTOM dataset and found that it was challenging to all models both on FACT and MIND questions. This highlights the limitations of current LLMs and presents significant opportunity for future LLMs to improve on understanding misleading charts.

Summary

The paper introduces CHARTOM, a benchmark that assesses MLLMs’ ability to extract data from charts and predict human misinterpretation using paired FACT and MIND questions.
It employs a Human Misleadingness Index based on human experiments to quantitatively evaluate the misleading potential of visual chart elements.
The study advances theory-of-mind research in AI by integrating multimodal inputs, enhancing applications in journalism, medical communications, and public policy.

An Essay on "CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal LLMs"

The paper "CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal LLMs" introduces CHARTOM, a novel benchmark aimed at evaluating the theory-of-mind capabilities of multimodal LLMs (MLLMs) with a focus on visual charts. This benchmark is designed to assess not merely the factual comprehension of data visualizations but also the ability to predict human misinterpretation due to misleading visual elements within those charts. This dual focus addresses the complex dynamics between human cognitive biases and factual information that can be represented visually through data charts.

Core Concepts

Theory of Mind in AI: The paper expands on the notion of theory-of-mind (ToM) in artificial intelligence, which entails an understanding and anticipation of human thought processes, rather than just factual data interpretation. The authors posit that for AI to effectively assist humans, it must comprehend both what is factually correct and how humans are likely to perceive said facts within varied contexts.
Visual Chart Misleadingness: By drawing on well-documented instances of misleading visual data in media and scientific reports, the paper argues for the necessity of AI systems that can not only parse visual data accurately but also gauge the likelihood of human misinterpretation. Contemporary AI ToM tasks have predominantly concentrated on text; this paper uniquely positions itself by focusing on visual perception, particularly in how charts can mislead audiences.

Benchmark Design and Methodology

The CHARTOM benchmark comprises 112 pairs of charts that feature both original and manipulated versions designed to test factual comprehension and perceived misleadingness. Key aspects include:

FACT and MIND Questions: For each chart, two types of questions are presented. FACT questions check the ability of AI to accurately extract the presented data while MIND questions seek to evaluate the AI's prediction of how misleading the chart might be to a typical human observer.
Human Misleadingness Index (HMI): The authors derive HMI by conducting experiments with human subjects to establish a ground truth for the benchmark's MIND questions. HMI quantifies what percentage of humans are likely to be misled by a particular chart, forming a basis for evaluating AI predictions against human interpretations.

Implications and Future Directions

This research introduces significant implications for both theoretical and practical domains in AI. From a theoretical perspective, the paper invites a reevaluation of ToM paradigms in artificial intelligence, particularly those that incorporate multimodal inputs. Unlike prior benchmarks focused purely on textual logic, CHARTOM stresses the multimodal components of human cognition, acknowledging the qualitative discrepancies between human perception and factual information processing in AI systems.

On the practical side, adopting this benchmark could substantially improve AI applications deployed in domains prone to misinformation via visual data, such as journalism, medical publications, and public policy communications. By equipping AI with better predictive models of human misinterpretation, the field could make strides toward more trustworthy human-AI collaborations.

Speculations on AI Development

Given the rising capabilities of MLLMs, this benchmark anticipates further research on end-to-end multimodal processing systems. It posits that future developments in AI will incorporate sophisticated reasoning architectures that integrate visual and textual data streams into cohesive interpretative frameworks. By advancing visual ToM capabilities, AI systems could become pivotal tools in counteracting the spread of misinformation in an increasingly data-driven society.

Concluding Remarks

The CHARTOM benchmark represents a critical advancement in the paper of theory-of-mind for artificial intelligence, emphasizing the nuanced interplay between factual accuracy and human perception. The insights derived from this research could significantly enhance the development and deployment of AI systems in domains requiring nuanced understanding of human cognition. As AI continues to permeate human decision-making processes, benchmarks such as CHARTOM serve as foundational steps towards aligning AI interpretation closely with human reasoning.