Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning (2409.17172v1)

Published 19 Sep 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen's kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model's knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems

Summary

  • The paper introduces a novel curiosity-driven question generation framework that evaluates LLMs using relevance, coherence, and diversity metrics.
  • Experimental results on a diverse dataset reveal that smaller models like Phi-2 can sometimes match or exceed larger models such as GPT-4 in key performance areas.
  • Controlled ablation studies with a weighted Cohen’s kappa of approximately 0.7 underscore the framework’s potential for advancing AI knowledge acquisition.

Evaluating LLMs on Curiosity-Driven Questioning

The paper "What Would You Ask When You First Saw a2+b2=c2a^2 + b^2 = c^2? Evaluating LLM on Curiosity-Driven Questioning" explores an innovative approach to assessing the knowledge acquisition capabilities of LLMs. Unlike conventional evaluation methods that focus on answering questions or following instructions, this paper introduces a new framework based on curiosity-driven question generation (CDQG). The framework is designed to mimic the human process of learning by asking questions, a fundamental cognitive skill.

Framework and Methodology

The proposed evaluation framework assesses LLMs by prompting them to generate questions about a given statement, simulating the curiosity of a person encountering new information. The quality of these questions is scored based on three key metrics: relevance, coherence, and diversity. Relevance measures how directly the questions pertain to the statement, coherence evaluates the logical flow between questions, and diversity assesses the range of topics covered by the questions.

To validate the scoring procedures, controlled ablation studies were employed. The dataset used for evaluation consists of 1101 statements in physics, chemistry, and mathematics, spanning various difficulty levels, 300 general knowledge statements, and 567 incorrect statements designed to test the models' critical inquiry capabilities. To ensure robustness, the scores provided by LLMs were compared with human evaluations, achieving a weighted Cohen’s kappa of approximately 0.7.

Key Findings

The paper evaluated several models, including GPT-4, Mistral 8x7b, and Phi-2 2.7b. The results revealed that while larger models like GPT-4 excelled in generating coherent and relevant questions, the smaller Phi-2 model demonstrated comparable or even superior performance in some cases. This indicates that model size alone does not necessarily determine knowledge acquisition potential.

Performance Highlights

  1. GPT-4: Generally led in most metrics and subjects, particularly in advanced tasks, displaying superior knowledge and reasoning capabilities.
  2. Mistral 8x7b: Showed strong performance, especially in Chemistry and Mathematics, often matching or exceeding GPT-4 in specific areas.
  3. Phi-2 2.7b: Despite its smaller size, Phi-2 performed exceptionally well in generating relevant and coherent questions, challenging the notion that larger models are inherently superior.
  4. Llama2 Models: Consistently scored lower, indicating potential limitations in training or architecture for CDQG tasks.
  5. Gemini: Although generally scoring lower, it showed exceptional diversity, likely due to its multi-modal training approach.

Practical and Theoretical Implications

The findings have significant implications for the development of more knowledgeable AI systems. The proposed CDQG framework addresses a previously overlooked capability of LLMs: their potential for novel knowledge discovery through inquiry. By quantifying this capability, the framework opens avenues for developing LLMs that not only store information but also actively seek new knowledge.

Future Developments in AI

Future research could explore expanding this framework to assess other cognitive abilities like reasoning depth, creativity, and factual accuracy. Furthermore, integrating metrics that capture human-centric qualities, such as emotional intelligence and adaptability, could provide a more comprehensive understanding of LLMs' cognitive processes. Additionally, there's potential for models that not only generate questions but also critically reflect on their outputs, paving the way for self-improving AI systems.

Conclusion

The paper introduces a novel evaluation framework for LLMs focused on curiosity-driven question generation. The paper demonstrates that while larger models like GPT-4 excel in generating high-quality questions, smaller models such as Phi-2 also show remarkable potential. This research highlights the importance of considering multiple dimensions, including relevance, coherence, and diversity, in evaluating AI's ability to simulate human-like inquiry. The proposed framework serves as a foundational step towards developing more inquisitive and knowledgeable AI systems that can autonomously drive innovation and intellectual growth.