Open Problems in Building and Evaluating LLM Research Agents: Self-Evaluation and Diversity
Characterize and address the open problems in constructing and evaluating large language model-based research agents, specifically the failures of LLM self-evaluation and the lack of diversity in generated ideas that limit inference-time scaling and reliable assessment.
References
Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation.
— Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
(2409.04109 - Si et al., 6 Sep 2024) in Abstract