Empirically evaluating commonsense intelligence in large language models with large-scale human judgments (2505.10309v1)

Published 15 May 2025 in cs.AI, cs.HC, and cs.SI

Abstract: Commonsense intelligence in machines is often assessed by static benchmarks that compare a model's output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a novel method for evaluating common sense in AI, specifically in LLMs, that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model's judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.

Authors (3)

Tuan Dung Nguyen (17 papers)
Duncan J. Watts (47 papers)
Mark E. Whiting (4 papers)

Summary

Empirical Assessment of Commonsense Intelligence in LLMs

The research paper presented conducts an in-depth examination of commonsense intelligence in LLMs, focusing on how these models align with human-like reasoning across various contexts. The authors argue against static benchmarks that traditionally assess AI's commonsense capabilities by comparing LLM outputs with a fixed set of human-prescribed labels, which may not accurately represent the diverse perspectives humans hold about what constitutes commonsense knowledge.

Methodology and Key Findings

The researchers propose a novel framework for evaluating AI commonsense, which takes into consideration the heterogeneity in human judgment. This evaluation framework utilizes large-scale human judgments to measure how well LLMs impartially reflect human consensus. The paper employs two distinct paradigms to quantify this aspect:

Individual-level Commonsensicality: Each LLM is treated as an independent survey respondent. This paradigm evaluates LLMs based on their ability to agree with the majority human judgment and predict the human consensus without relying on predefined ground truths. Results from this assessment show significant variance in LLM performance, with the most competent models equating to approximately 64.5% of human evaluators. Notably, models across different scales—from smaller, open-weight models like Flan-T5-XXL to larger, proprietary ones like GPT-4—demonstrate varying levels of commonsense competence without a linear correlation to model size.
Collective-level Commonsensicality: LLMs are evaluated as aggregators of human opinion, generating hypothetical populations of 'silicon samples' to see how closely they align with the human consensus on commonsensical statements. Here, LLMs are prompted to simulate human societal judgments, and the results indicate only a modest correlation with human judgment (up to $r = .43$ ), which is lower than the $r = .60$ baseline established by randomly split human samples.

Implications and Future Directions

This research contributes significantly to AI evaluation by empirically grounding AI assessments in the behavioral patterns and variances observed in human cognition. The findings suggest that current LLMs, though powerful, may not yet fully encapsulate the nuances of human commonsense reasoning across various contexts. This misalignment could be due to the selective nature of the data that models are trained on or due to biases introduced during the instruction fine-tuning phases intended to align models with human-preferred outputs.

For future research and development, the authors highlight two main avenues: (i) expanding the population base to include a wider array of human demographics to offset cultural or regional biases, and (ii) enhancing model representativeness by integrating more nuanced, persona-driven simulations or modifying training methodologies to mitigate overconfidence and distortion in silicon-based output populations.

In conclusion, by departing from conventional benchmark approaches, the paper paves the way for more personalized and socially aware AI systems that adapt to diverse human collectivities. This direction aligns technologies closer to societal needs and understanding of varied human frameworks of common sense, fostering AI innovations that are more inclusive and reflective of intricate human experiences.

Related Papers

Find Related Papers

YouTube

Show All Videos