Empirical Assessment of Commonsense Intelligence in LLMs
The research paper presented conducts an in-depth examination of commonsense intelligence in LLMs, focusing on how these models align with human-like reasoning across various contexts. The authors argue against static benchmarks that traditionally assess AI's commonsense capabilities by comparing LLM outputs with a fixed set of human-prescribed labels, which may not accurately represent the diverse perspectives humans hold about what constitutes commonsense knowledge.
Methodology and Key Findings
The researchers propose a novel framework for evaluating AI commonsense, which takes into consideration the heterogeneity in human judgment. This evaluation framework utilizes large-scale human judgments to measure how well LLMs impartially reflect human consensus. The paper employs two distinct paradigms to quantify this aspect:
- Individual-level Commonsensicality: Each LLM is treated as an independent survey respondent. This paradigm evaluates LLMs based on their ability to agree with the majority human judgment and predict the human consensus without relying on predefined ground truths. Results from this assessment show significant variance in LLM performance, with the most competent models equating to approximately 64.5% of human evaluators. Notably, models across different scales—from smaller, open-weight models like Flan-T5-XXL to larger, proprietary ones like GPT-4—demonstrate varying levels of commonsense competence without a linear correlation to model size.
- Collective-level Commonsensicality: LLMs are evaluated as aggregators of human opinion, generating hypothetical populations of 'silicon samples' to see how closely they align with the human consensus on commonsensical statements. Here, LLMs are prompted to simulate human societal judgments, and the results indicate only a modest correlation with human judgment (up to r=.43), which is lower than the r=.60 baseline established by randomly split human samples.
Implications and Future Directions
This research contributes significantly to AI evaluation by empirically grounding AI assessments in the behavioral patterns and variances observed in human cognition. The findings suggest that current LLMs, though powerful, may not yet fully encapsulate the nuances of human commonsense reasoning across various contexts. This misalignment could be due to the selective nature of the data that models are trained on or due to biases introduced during the instruction fine-tuning phases intended to align models with human-preferred outputs.
For future research and development, the authors highlight two main avenues: (i) expanding the population base to include a wider array of human demographics to offset cultural or regional biases, and (ii) enhancing model representativeness by integrating more nuanced, persona-driven simulations or modifying training methodologies to mitigate overconfidence and distortion in silicon-based output populations.
In conclusion, by departing from conventional benchmark approaches, the paper paves the way for more personalized and socially aware AI systems that adapt to diverse human collectivities. This direction aligns technologies closer to societal needs and understanding of varied human frameworks of common sense, fostering AI innovations that are more inclusive and reflective of intricate human experiences.