OPT-BENCH: LLM Optimization Benchmark

Updated 30 July 2025

OPT-BENCH is a large-scale benchmarking suite that evaluates LLMs' ability to solve complex optimization tasks in both machine learning and NP problem domains.
It employs a modular OPT-Agent architecture and an iterative workflow—drafting, improving, and debugging—with historical feedback to refine solutions.
The benchmark uses strict quantitative metrics such as Improvement Rate and Buggy Rate to assess performance, robustness, and solution quality systematically.

OPT-BENCH is a large-scale benchmarking suite specifically constructed for evaluating the capabilities of LLMs to solve complex optimization problems in extensive search spaces through iterative reasoning and solution refinement. It is distinguished by its diverse problem coverage—including real-world ML tasks and classical NP problems—and its explicit incorporation of iterative optimization workflows that leverage historical feedback. The OPT-BENCH framework introduces rigorous quantitative metrics and a modular agent architecture (OPT-Agent) to emulate human-like incremental problem-solving, and provides an open-source platform for reproducible evaluation and further research (Li et al., 12 Jun 2025).

1. Benchmark Composition and Problem Types

OPT-BENCH comprises a curated set of 30 optimization tasks drawn from two domains:

Machine Learning Tasks (20 total): These involve canonical tasks from Kaggle competitions, including regression, classification, and time-series forecasting. Each task is defined by a dataset, an explicit submission format (typically Python script-based solutions), and clear evaluation metrics such as mean squared error (MSE) or classification accuracy.
Classical NP Problems (10 total): This subset targets combinatorial and discrete optimization scenarios, including graph coloring, Hamiltonian cycle, knapsack, and related classes. Each NP task is specified by an instance definition, accepted solution format, and validation rules.

This dual focus enables comprehensive assessment of both continuous and discrete optimization abilities in LLMs across varied domains.

2. Optimization Workflow and OPT-Agent Architecture

OPT-BENCH evaluates LLMs in an end-to-end optimization loop that mimics human iterative reasoning. The central component is the OPT-Agent framework, which operates in three linked stages for each task:

Drafting: The agent generates an initial candidate solution—such as a Python script for ML problems or a structured direct answer for NP problems.
Improving: Given feedback (including historical solutions, error logs, validation summaries), the agent refines its most recent solution. This can include adjusting model architectures, changing hyperparameters, modifying code logic, or changing discrete parameters for NP tasks.
Debugging: When the validation stage finds an error—such as invalid outputs or rule violations—the agent is prompted to analyze the error and correct its implementation.

Crucially, at each iteration, the workflow incorporates historical context: previous drafts, feedback, and error traces. This enables the agent to learn from its prior attempts, adaptively fine-tune its approach, and incrementally improve solution quality.

3. Evaluation Methodology and Metrics

OPT-BENCH introduces a strict multi-step protocol for quantitative and qualitative assessment:

Improvement Rate (IR):

$IR(\alpha, \beta) = \frac{1}{n}\sum_{i=1}^n \frac{\alpha_i}{\beta_i}$

Where $\alpha_i$ is the performance metric (e.g., higher accuracy, lower error) after improvement and $\beta_i$ is the initial or baseline value, averaged over $n$ tasks. IR quantifies the relative gain attributable to iterative refinement.

Win Count: The number of tasks for which incorporating historical context (as opposed to stateless iteration) yields superior outcomes.
Buggy Rate: The proportion of outputs or runs in which the agent returns an invalid or non-conforming solution, capturing robustness and reliability.
Rank: The agent's average ordinal position relative to competing models/strategies based on prescribed task evaluation metrics.

Separate metrics are maintained for ML and NP tasks to account for domain-specific notions of correctness and solution quality.

4. Experimental Findings and Model Comparisons

Experiments in OPT-BENCH are conducted with nine LLMs from six distinct model families. Key findings include:

Historical context is highly beneficial: Agents using historical information yield systematically higher Improvement Rates and Win Counts compared to stateless baselines. This effect is especially strong for ML tasks where solution refinement can progress smoothly via hyperparameter or architecture tuning.
Iteration depth matters: Increasing the number of optimization steps (from 5 to 20) generally improves performance, but some models exhibit diminishing returns, suggesting a saturation point where further iterations confer limited additional benefit.
Divergent behavior on NP tasks: For NP problems, while iterative refinement has value, LLMs sometimes favor complete regeneration of solutions, which can limit the effectiveness of feedback-driven incrementality.
Temperature and architecture effects: Lower or moderate temperature settings tend to favor more stable and valid refinements; exploration-exploitation tradeoffs are nontrivial and model-dependent.

5. Design Principles and Task Structure

Each task in OPT-BENCH provides:

Formal definitions and constraints: Including dataset splits, solution interface specifications, and evaluators for correctness, feasibility, and performance.
Iterative feedback protocol: After each agent submission, feedback includes both numerical measures and diagnostic logs, providing the context needed for meaningful iterative refinement.
Open-source reproducibility: All datasets, evaluation scripts, and baseline implementations are made available, supporting community benchmarking, comparison, and extension.

This ensures experiments are reproducible and solution quality is directly comparable across models and research efforts.

6. Implications for LLM-driven Optimization and Iterative Reasoning

OPT-BENCH highlights several important trends in LLM-based optimization:

Iterative solution refinement, driven by feedback, is critical for complex search spaces. The ability to learn from errors and historical trajectories distinguishes more capable LLM agents.
Problem domain impacts the utility of iterative context: ML tasks see clear gains from this approach due to the smoothness of search space; combinatorial NP tasks pose additional challenges for history-aware optimization.
Benchmarking strategies must track both performance and robustness: Buggy rate and Win Count illuminate practical reliability of agent solutions, not just raw optimality.

The release of OPT-BENCH, including the OPT-Agent framework and complete evaluation suite, establishes a new standard for assessing how LLMs can be harnessed to solve large-scale, complex optimization problems through iterative, feedback-driven approaches.

7. Future Directions and Open Resources

All resources (datasets, code, tools) are openly released (Li et al., 12 Jun 2025), providing a platform for future algorithmic innovation and meta-evaluation. Prospective extensions include:

Enlargement of the benchmark with harder problem classes and multi-objective optimization scenarios
Deeper comparative studies of feedback mechanisms and memory architectures
Exploration of automated agent selection, adaptation, and self-tuning strategies

OPT-BENCH positions itself as a canonical testbed for the study and development of optimization-oriented LLMs, iterative reasoning agents, and human-in-the-loop optimization protocols in both academic and practical contexts.

PDF Markdown Chat (Pro)

References (1)

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems (2025)

OPT-BENCH: LLM Optimization Benchmark

1. Benchmark Composition and Problem Types

2. Optimization Workflow and OPT-Agent Architecture

3. Evaluation Methodology and Metrics

4. Experimental Findings and Model Comparisons

5. Design Principles and Task Structure

6. Implications for LLM-driven Optimization and Iterative Reasoning

7. Future Directions and Open Resources

Whiteboard

Follow Topic

Continue Learning

OPT-BENCH: LLM Optimization Benchmark

1. Benchmark Composition and Problem Types

2. Optimization Workflow and OPT-Agent Architecture

3. Evaluation Methodology and Metrics

4. Experimental Findings and Model Comparisons

5. Design Principles and Task Structure

6. Implications for LLM-driven Optimization and Iterative Reasoning

7. Future Directions and Open Resources

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics