HeuriGym Framework: Agentic LLM Benchmarking
- HeuriGym is an open-source framework that evaluates LLMs by having them autonomously generate, debug, and refine complete heuristic algorithms for combinatorial optimization.
- It employs an agentic, iterative workflow that synthesizes full heuristic programs, iteratively improving performance through feedback-driven refinements.
- The framework introduces the Quality-Yield Index (QYI) to objectively benchmark LLM performance against expert-engineered baselines in real-world problem domains.
HeuriGym is an open-source agentic evaluation framework for systematically benchmarking LLMs as generators of heuristic algorithms in combinatorial optimization (CO). It introduces an agentic, iterative workflow that requires LLMs to autonomously produce, debug, and refine complete heuristic programs for non-canonical, real-world scientific and engineering problems with vast solution spaces. Central to its methodology is the Quality-Yield Index (QYI), a metric that quantifies both the feasibility and utility of LLM-synthesized solutions against strong human-engineered baselines.
1. Framework Motivation and Approach
HeuriGym was created to address the inadequacies of prevailing LLM evaluation methodologies, which typically fall into either closed-ended (and thus saturable) tasks or open-ended, subjective comparisons. The framework enforces rigorous, objective assessment by requiring complete heuristic program synthesis rather than code completion, and leverages computation-based verification rather than human grading. Its core goal is to probe the extent of autonomous reasoning, code synthesis, constraint handling, and adaptive improvement achievable by contemporary LLMs when tasked with agentic problem-solving in combinatorial optimization.
Key features:
- End-to-end program synthesis from problem statement to executable solution.
- Iterative agentic loop: synthesis, execution, failure/success analysis, refinement.
- Use of detailed, realistic problem instances with public datasets and expert baselines for quantitative benchmarking.
- Emphasis on tool use, planning, and progressive self-improvement.
2. Architecture and Agentic Workflow
The HeuriGym evaluation protocol follows an explicit pipeline:
- Problem Specification: Each benchmark presents a full problem description, formal objectives, mathematical constraints, and I/O interface requirements. This is designed to replicate documentation provided to engineers in realistic scientific settings.
- Prompting:
- System prompt defines environment parameters (e.g., Python 3.12, permitted libraries, hardware/time limits).
- User prompt delivers the problem specification alongside a minimal code skeleton—no starter implementations or hints.
- Program Generation and Execution:
- The LLM must synthesize the entire heuristic algorithm.
- Generated programs are executed strictly (compiled/interpreted as appropriate), with logs, errors, output files, and system resource usage recorded.
- Automated Verification and Feedback:
- Three-stage outcome checks: successful execution and I/O handling; feasibility and format of output; satisfaction of all problem constraints validated by a domain-specific checker.
- All feedback (including verification status and logs) is looped back as context for the next agentic iteration.
- Iterative Refinement:
- The process repeats up to 10 times per instance. The LLM receives explicit feedback and demonstrations, iteratively adjusting its strategy, logic, and code to satisfy the problem requirements and improve performance.
This architecture enables a testbed in which LLMs are evaluated not only for immediate code generation capability but for their ability to engage in agentic problem-solving and self-improvement akin to algorithm engineering workflows.
3. Benchmark Domains and Problem Selection
In contrast to benchmarks rooted in canonical “textbook” CO problems, HeuriGym’s task set is curated for real-world relevance, resistance to memorization, and evaluation of creative heuristic synthesis. Selection criteria require tasks to have a clearly defined solution structure, rigorous objective metrics, and limited prior literature exposure.
Notable problem areas include:
| Domain | Example Problem | Core Challenge |
|---|---|---|
| Electronic Design Automation (EDA) | Operator scheduling, technology mapping, global routing | Resource-constrained placement and assignment in circuit design |
| Compilers | E-graph extraction, intra-operator parallelism | Optimal expression extraction, scheduling of computations |
| Biology | Protein sequence design, Mendelian error detection | Combinatorial design and integrity checking in biological data |
| Logistics | Airline crew pairing, pickup/delivery with windows | Large-scale scheduling, routing under constraints |
All problems come with demonstration sets (few-shot context) and held-out evaluation sets using realistic data and public expert baselines.
4. Evaluation Metrics and the Quality-Yield Index (QYI)
Central to HeuriGym is its quantitative multi-stage metric suite:
- solve_s@i: For (execution, output, verification), the metric
measures the proportion of instances succeeding at each pipeline stage within refinement rounds.
- Quality: For verified solutions in iteration :
with the LLM’s cost (objective), the expert’s.
- Yield: Fraction of passed instances,
- Quality-Yield Index (QYI): The harmonic mean,
QYI captures the tradeoff between finding any valid solutions (yield) and the comparative quality of those solutions. QYI reaches 1 only with expert-quality on all instances and 0 if any dimension vanishes.
This multi-dimensional approach provides rigorous accounting for both LLM feasibility and solution efficacy, penalizing both trivial/invalid outputs and poor optimization.
5. Empirical Findings and LLM Performance
Nine contemporary LLMs—including OpenAI GPT-o4-mini-high, Anthropic Claude-3.7-Sonnet, Google Gemini-2.5-Pro/Flash, DeepSeek, Meta LLaMA-3/4, and Alibaba Qwen3-235B—were evaluated under strict, reproducible settings (8 CPU cores/sample, limited wall-clock, API-based access).
Key observations:
- Synthesis robustness: Most LLMs fail to produce valid, executable heuristics ab initio for >50% of test cases. Iterative agentic refinement (with up to 10 feedback rounds) significantly increases success rates, but many hard instances remain unsolved even after multiple corrections.
- Solution quality: Top-performing models attain QYI scores around 0.6 on aggregate (i.e., 60% of expert performance), with many models remaining below 0.3 on the most challenging problems.
- Error sources: Failures arise from tool use hallucinations (wrong/broken API calls), planning errors (suboptimal decomposition, lack of strategic reasoning), misinterpretation/violation of hard constraints, and computational unreliability (timeouts, infinite loops).
- Prompt and temperature effects: Higher temperature values increase algorithmic diversity but sharply reduce yield (more invalid outputs). The inclusion of demonstration cases (few-shot) consistently improves both QYI and convergence.
- Case dynamics: Illustrative studies (e.g., in EDA technology mapping) show LLMs progressing from naive, intractable enumeration-based solutions to more sophisticated, DP-like approaches through the feedback loop, but seldom reaching domain-specific expert efficiency.
A plausible implication is that while agentic prompting and feedback significantly enhance LLM performance, limitations in generalization, long-horizon reasoning, and robust tool use persist even at scale.
6. Limitations and Analytical Insights
HeuriGym’s results expose important structural deficiencies in current LLMs for combinatorial optimization:
- Tool ecosystem awareness: LLMs routinely hallucinate or misuse available libraries despite explicit system prompts, indicating incomplete system modeling and limited code retrieval integration.
- Algorithmic planning: Ability to decompose tasks into sequential, logically coherent modules is frequently lacking, especially for multi-stage or resource-constrained optimization.
- Adaptive learning gaps: Feedback from failed executions is often incorporated only superficially; models may overfit to prior errors without genuine generalization or abstraction.
- Constraint compliance: Even explicit, testable resource, topology, or dependency constraints are systematically misinterpreted or ignored, impacting both output feasibility and optimization.
- Execution reliability: Rates of non-compiling, incorrect-running, or infinite-looping code are non-trivial, highlighting reliability gaps in current generation.
These results indicate that realistic, agentic CO synthesis remains well beyond state-of-the-art LLMs, particularly in scientific and domain-engineering regimes.
7. Illustrative Example and Practical Dynamics
A representative workflow in the technology mapping task demonstrates the agentic feedback and refinement process. The LLM is provided with a problem description and a code skeleton:
1 2 3 4 5 |
def solve(input_file: str, solution_file: str): # ...Parse input graph... # ...Enumerate all cuts up to K nodes... # ...For each node, select covering based on DP... # ...Output result in specified format... |
This agentic loop operationalizes adaptive algorithm engineering and surfaces both LLM strengths and structural weaknesses in a manner not revealed by static code-completion or QA benchmarks.
HeuriGym inaugurates a new paradigm in LLM benchmarking by enforcing comprehensive, agentic engagement with challenging combinatorial optimization tasks. Its architecture, agentic workflow, and quantifiable metrics set a rigorous standard for measuring progress in LLM scientific reasoning and autonomous program synthesis. Empirical evidence suggests substantial limitations remain in tool use, reasoning, adaptive learning, and constraint adherence for deployed models, underscoring significant research challenges and opportunities in agentic LLM development for scientific and engineering domains.