A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity (2302.04023v4)

Published 8 Feb 2023 in cs.CL and cs.AI

Abstract: This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We also release codebase for evaluation set extraction.

References (152)

Citations (1,160)

View on Semantic Scholar

Summary

The paper presents a comprehensive evaluation framework that benchmarks ChatGPT on 23 datasets spanning eight NLP tasks in multitask, multilingual, and multimodal settings.
It demonstrates ChatGPT's strengths in deductive reasoning and high-resource language performance, while noting limitations in inductive reasoning and low-resource language support.
The study highlights ChatGPT's interactive potential through iterative refinements in summarization, machine translation, and image generation tasks.

Overview of Multitask, Multilingual, and Multimodal Evaluation of ChatGPT

The paper, titled "A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity," presents a systematic framework for evaluating the capabilities of ChatGPT. The evaluation leverages 23 datasets encompassing eight diverse NLP tasks, namely question answering, reasoning, summarization, machine translation, sentiment analysis, language identification, task-oriented dialogue, and misinformation detection. Recognizing the absence of previous benchmarking results for ChatGPT, this paper provides a comprehensive third-party evaluation to assess the multitask, multilingual, and multimodal aspects of this model, alongside its reasoning abilities and interactive features.

Evaluation Framework

The framework's hallmark is its inclusiveness in testing various facets of ChatGPT's capabilities:

Multitask Evaluation: The evaluation spans multiple standard NLP tasks using datasets like CNN/DM, SAMSum, FLoRes-200, bAbI, EntailmentBank, and more. ChatGPT's performance is juxtaposed with prior state-of-the-art (SOTA) models in both fine-tuned and zero-shot settings.
Multilingual Evaluation: ChatGPT's understanding and generation abilities are tested across high-resource and low-resource languages. Sentiment analysis, language identification, and machine translation are used to probe these linguistic capabilities.
Multimodal Evaluation: Explores how ChatGPT handles text-to-image generation, focusing on the creation of images from textual descriptions via intermediate code representation.

Numerical Results and Performance Insights

Multitask Performance:
- ChatGPT surpasses previous zero-shot LLMs on 9 out of 13 datasets and even exceeds certain fine-tuned task-specific models. However, limitations are noted in task-oriented and knowledge-grounded dialogue tasks.
- It effectively handles tasks like summarization using datasets such as CNN/DM and SAMSum, although it demonstrates variability in performance compared to fine-tuned models like Bart.
Multilingual Capabilities:
- ChatGPT performs well in high-resource languages such as French and Chinese but struggles with low-resource languages like Javanese and Buginese.
- It displays better understanding than generation ability in non-Latin scripts, indicating a gap in equivalently handling tasks across diverse scripts.
Multimodal Abilities:
- The flag drawing task demonstrates ChatGPT's potential to convert text into visual code (e.g., SVG), though it highlights basic limitations in accurately depicting complex shapes and sizes.
- The multimodal generation efficiency improves considerably with iterative refinement across multiple turns.

Reasoning and Hallucination Evaluations

Reasoning Skills:
- Detailed evaluations span deductive, inductive, abductive, and commonsense reasoning, using datasets like $\alpha$ NLI, CommonsenseQA, and HotpotQA.
- ChatGPT exhibits notably high performance in deductive reasoning but is less reliable in tasks demanding inductive and multi-hop reasoning.
Hallucination and Factuality:
- The common issue of hallucinations is affirmed, where ChatGPT generates extrinsic hallucinations, including both factual and non-factual, across tasks like machine translation and summarization.
- Evaluations with TruthfulQA reveal that ChatGPT sometimes replicates human falsehoods, emphasizing the need for improved factuality controls.

Interactivity and Iterative Improvements

ChatGPT's interactive abilities are a significant differentiator from its predecessors:

In summarization tasks, iterative prompts help refine outputs to be more concise, improving ROUGE scores.
Machine translation benefits from multi-turn interactions, where post-editing helps correct and improve translations, especially for low-resource languages.
Multimodal interactions allow for iterative refinements in image generation, akin to human interaction in creative tasks.

Implications and Future Directions

ChatGPT's evaluation highlights several practical and theoretical implications:

Practical: The model's utility in multilingual contexts and interactive applications is promising for real-world deployments. However, refinement in low-resource language support and iterative correction strategies needs to be deepened.
Theoretical: Future research should address enhancing deductive and inductive reasoning frameworks, refining RLHF (Reinforcement Learning with Human Feedback) for improved factual accuracy, and developing robust mechanisms to mitigate hallucinations.

The framework proposed by the paper sets a comprehensive benchmark for assessing the evolving capabilities of LLMs and can guide future developments in AI. These findings stress the criticality of diverse and iterative evaluation methods in understanding and advancing the performance of LLMs like ChatGPT.