- The paper introduces a novel curiosity-driven question generation framework that evaluates LLMs using relevance, coherence, and diversity metrics.
- Experimental results on a diverse dataset reveal that smaller models like Phi-2 can sometimes match or exceed larger models such as GPT-4 in key performance areas.
- Controlled ablation studies with a weighted Cohen’s kappa of approximately 0.7 underscore the framework’s potential for advancing AI knowledge acquisition.
Evaluating LLMs on Curiosity-Driven Questioning
The paper "What Would You Ask When You First Saw a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning" explores an innovative approach to assessing the knowledge acquisition capabilities of LLMs. Unlike conventional evaluation methods that focus on answering questions or following instructions, this paper introduces a new framework based on curiosity-driven question generation (CDQG). The framework is designed to mimic the human process of learning by asking questions, a fundamental cognitive skill.
Framework and Methodology
The proposed evaluation framework assesses LLMs by prompting them to generate questions about a given statement, simulating the curiosity of a person encountering new information. The quality of these questions is scored based on three key metrics: relevance, coherence, and diversity. Relevance measures how directly the questions pertain to the statement, coherence evaluates the logical flow between questions, and diversity assesses the range of topics covered by the questions.
To validate the scoring procedures, controlled ablation studies were employed. The dataset used for evaluation consists of 1101 statements in physics, chemistry, and mathematics, spanning various difficulty levels, 300 general knowledge statements, and 567 incorrect statements designed to test the models' critical inquiry capabilities. To ensure robustness, the scores provided by LLMs were compared with human evaluations, achieving a weighted Cohen’s kappa of approximately 0.7.
Key Findings
The paper evaluated several models, including GPT-4, Mistral 8x7b, and Phi-2 2.7b. The results revealed that while larger models like GPT-4 excelled in generating coherent and relevant questions, the smaller Phi-2 model demonstrated comparable or even superior performance in some cases. This indicates that model size alone does not necessarily determine knowledge acquisition potential.
- GPT-4: Generally led in most metrics and subjects, particularly in advanced tasks, displaying superior knowledge and reasoning capabilities.
- Mistral 8x7b: Showed strong performance, especially in Chemistry and Mathematics, often matching or exceeding GPT-4 in specific areas.
- Phi-2 2.7b: Despite its smaller size, Phi-2 performed exceptionally well in generating relevant and coherent questions, challenging the notion that larger models are inherently superior.
- Llama2 Models: Consistently scored lower, indicating potential limitations in training or architecture for CDQG tasks.
- Gemini: Although generally scoring lower, it showed exceptional diversity, likely due to its multi-modal training approach.
Practical and Theoretical Implications
The findings have significant implications for the development of more knowledgeable AI systems. The proposed CDQG framework addresses a previously overlooked capability of LLMs: their potential for novel knowledge discovery through inquiry. By quantifying this capability, the framework opens avenues for developing LLMs that not only store information but also actively seek new knowledge.
Future Developments in AI
Future research could explore expanding this framework to assess other cognitive abilities like reasoning depth, creativity, and factual accuracy. Furthermore, integrating metrics that capture human-centric qualities, such as emotional intelligence and adaptability, could provide a more comprehensive understanding of LLMs' cognitive processes. Additionally, there's potential for models that not only generate questions but also critically reflect on their outputs, paving the way for self-improving AI systems.
Conclusion
The paper introduces a novel evaluation framework for LLMs focused on curiosity-driven question generation. The paper demonstrates that while larger models like GPT-4 excel in generating high-quality questions, smaller models such as Phi-2 also show remarkable potential. This research highlights the importance of considering multiple dimensions, including relevance, coherence, and diversity, in evaluating AI's ability to simulate human-like inquiry. The proposed framework serves as a foundational step towards developing more inquisitive and knowledgeable AI systems that can autonomously drive innovation and intellectual growth.