Dice Question Streamline Icon: https://streamlinehq.com

Open Problems in Building and Evaluating LLM Research Agents: Self-Evaluation and Diversity

Characterize and address the open problems in constructing and evaluating large language model-based research agents, specifically the failures of LLM self-evaluation and the lack of diversity in generated ideas that limit inference-time scaling and reliable assessment.

Information Square Streamline Icon: https://streamlinehq.com

Background

The authors over-generate and rank LLM ideas and find major limitations: idea generation diversity plateaus quickly (only ~200 unique ideas from 4000 seeds per topic after deduplication), and LLM-as-a-judge exhibits low agreement with expert reviewers (e.g., pairwise rankers around 53.3% accuracy versus higher human-human consistency).

These issues undermine both scaling via over-generation and automatic evaluation via LLMs, motivating focused work to diagnose, measure, and mitigate self-evaluation failures and diversity shortcomings in research agents.

References

Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation.