OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems (2506.10764v1)

Published 12 Jun 2025 in cs.AI and cs.LG

Abstract: LLMs have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state-of-the-art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open-sourced to promote further research in advancing LLM-driven optimization and iterative reasoning. Project page: \href{https://github.com/OliverLeeXZ/OPT-BENCH}{https://github.com/OliverLeeXZ/OPT-BENCH}.

Summary

Evaluation of LLMs in Complex Search Space Optimization with OPT-BENCH

The paper "OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems" presents a comprehensive framework for evaluating LLMs in the context of optimization problems that demand iterative reasoning and the ability to learn from historical feedback. Named OPT-BENCH, this benchmark consists of 30 challenging tasks, split into 20 ML tasks from Kaggle competitions and 10 classical NP-complete problems. The paper develops an evaluation framework, OPT-Agent, which rigorously tests the model’s capacity to improve solutions iteratively, akin to human cognitive processes of learning from past results.

Overview of OPT-BENCH and OPT-Agent

OPT-BENCH is designed to assess the capabilities of LLMs across a wide spectrum of optimization challenges. The inclusion of both ML tasks, such as house price prediction and sentiment analysis, and NP problems like graph coloring and the traveling salesman problem (TSP), ensures a rigorous and diverse evaluation environment. The benchmark leverages these tasks to evaluate a model's ability to iteratively refine solutions rather than providing a one-shot answer.

OPT-Agent is introduced as a framework that resembles human iterative reasoning. It supports three key operations: drafting an initial solution, improving solutions by leveraging historical context, and debugging if a solution is invalid. This approach emphasizes the importance of using historical information for solution refinement, thereby providing a holistic analysis of LLMs' abilities in optimization tasks.

Experimental Setup and Results

The paper involves nine LLMs across several proprietary and open-source model families. It rigorously investigates the impact of historical data on iterative optimization through extensive experiments, examining variables such as optimization steps, temperature settings, and model architecture.

In the ML tasks, historical context was shown to consistently enhance optimization performance across most models. Performance improvements were quantified using metrics like Win Count, Improvement Rate (IR), and Buggy Rate. The proprietary model gpt-o3-mini, for instance, exhibited significant performance improvements under all step counts when using historical feedback, highlighting the role of context in enhancing convergence. Conversely, open-source models like Qwen2.5-72B-Instruct demonstrated higher error rates, suggesting areas for potential improvement.

For NP tasks, incorporating historical information also yielded improved performance, albeit with some limitations. Models like gpt-4o-2024-08-06 and gemini-2.0-flash showed better optimization results with fewer buggy outputs upon increasing iterations, yet the open-source models lagged behind. The results underscored the challenge for LLMs in applying iterative learning effectively to combinatorial problems, where achieving valid solutions remains a pressing challenge.

Implications and Future Prospects

The development of OPT-BENCH and OPT-Agent provides significant insights into how LLM resources can be harnessed for solving complex search space problems. Practically, this benchmark is conducive to advancing optimization solutions in real-world applications, such as improving decision-making systems and enhancing automated reasoning capabilities. The emphasis on iterative refinement and the integration of context are critical themes that could guide future LLM development strategies focused on harnessing model feedback.

From a theoretical standpoint, the findings underline the need for improved methodologies in applying historical context, especially in NP domain tasks where solution space exploration remains inefficient. Future research could explore more sophisticated mechanisms for context utilization, potentially through reinforcement learning or advanced memory architectures to mitigate the limitations faced by current open-source models.

In conclusion, OPT-BENCH emerges as a robust tool for pushing the boundaries of LLM capabilities in optimization problems, providing a solid foundation for subsequent innovations in AI research focused on dynamic, ongoing solution improvement. Future experimentation and model refinement based on this benchmark could pave the way for more adaptive and contextually aware AI systems capable of tackling more complex, real-world applications.

GitHub

GitHub - OliverLeeXZ/OPT-BENCH (4 stars)