Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong (2501.09775v1)

Published 16 Jan 2025 in cs.CL and cs.AI

Abstract: One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.

PDF Abstract

Evaluation of Confidence in LLMs Using Multiple Choice Questions

The paper "Multiple Choice Questions: Reasoning Makes LLMs More Self-Confident Even When They Are Wrong" investigates the confidence dynamics of LLMs when subjected to multiple-choice questions (MCQs). This paper is centered on understanding whether the confidence level of LLMs is influenced by the presence or absence of reasoning before providing an answer, which is commonly known as the "chain of thought" (CoT) approach.

Research Background

LLMs are frequently evaluated using MCQs as they allow for standardized and scalable assessment across a wide range of topics. By utilizing a few examples or "few shots" in prompts, the evaluation can be further refined. Traditionally, models could either be prompted to directly choose an answer or to first articulate a reasoning process, as promoted by CoT techniques. The authors of this paper studied the effect of reasoning on LLM confidence levels, measured through the estimated probability LLMs assign to their answers.

Methodology

The paper's methodology involves a comparative analysis between responses generated by LLMs when prompted directly and when encouraged to provide step-by-step reasoning. Seven models were tested with a specially curated dataset, the Massive Multitask Language Understanding (MMLU), which includes a spectrum of topics and domains. To ensure comprehensive insights, a mixture of open-source and proprietary models of varying sizes from different developers (such as Meta, Mistral, Google, 01.AI, and OpenAI) were employed.

Key Findings

Increased Confidence with Reasoning: It was observed that LLMs generally exhibited higher confidence levels in their answers when reasoning was included, regardless of the answer's correctness.
Confidence in Incorrect Answers: Remarkably, the paper highlights that the confidence boost was more pronounced when the models arrived at incorrect answers after providing reasoning. This suggests an intrinsic limitation where the reasoning bolsters a false sense of certainty.
Uniform Trends Across Models and Topics: The increase in confidence was consistent across different models and varied question topics, although subjects requiring substantial reasoning showed more marked increases. Nonetheless, reasoning appeared to sometimes mitigate the model's reliance on intuition, occasionally leading to poorer results in certain areas such as history or moral disputes.
Consistency with Human Behavior: The paper draws parallels with human cognitive behavior, noting that, like humans, LLMs also demonstrate increased confidence after explaining their reasoning—an observation supported by prior psychological studies on human judgment.

Implications and Future Research

The paper underlines the importance of cautiously employing estimated confidence levels as a metric for model evaluation. In cases where reasoning inflates confidence falsely, the reliance on such metrics might become problematic. To address this, the paper calls for more granular studies to delineate when confidence metrics serve as reliable indicators of model performance.

There is also potential for this research to inform the development of more refined LLM evaluation frameworks that consider confidence not only as a measure of response probability but as an interaction of several cognitive-like processes within the models.

Moving forward, understanding the alignment of LLM behaviors with human cognitive processes can further illuminate AI interpretability and teach us more about the interplay between different reasoning strategies and confidence across diverse AI applications. This could eventually lead to developing training approaches that balance reasoning and intuitive insights more effectively, thereby enhancing the accuracy and reliability of LLM-generated responses.