AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

Published 9 May 2026 in cs.AI and cs.NE | (2605.08756v1)

Abstract: Automatic heuristic design (AHD) has emerged as a promising paradigm for solving NP-hard combinatorial optimization problems (COPs). Recent works show that LLMs, when integrated into well-designed frameworks (i.e., LLM-AHD), can autonomously discover high-performing heuristics. However, existing LLM-AHD frameworks typically treat LLMs as passive generators within fixed workflows, where the model generates heuristics from manually designed, limited context. Such context may fail to capture state-dependent information (e.g., specific failure modes), leading to inefficient trial-and-error exploration. To overcome these limitations, we propose AHD Agent, a novel tool-integrated, multi-turn framework that empowers LLMs to proactively decide whether to generate heuristics or invoke tools to retrieve targeted evidence from the solving environment. To effectively train such a dynamic decision-making agent, we introduce an agentic reinforcement learning (RL) system, which leverages a novel environment synthesis pipeline to optimize a compact model's generalizable AHD capabilities. Experiments across eight diverse domains, including four held-out tasks, demonstrate that our 4B-parameter agent matches or surpasses state-of-the-art baselines using much larger models, while requiring significantly fewer evaluations. Model and inference scaling analysis further reveals that AHD Agent offers an effective trajectory toward truly autonomous heuristic design.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper proposes an agentic RL framework that enables LLMs to dynamically generate, revise, and evaluate heuristics for NP-hard combinatorial optimization problems.
It leverages state-dependent decision-making with integrated diagnostic tools to adaptively refine heuristics, achieving sample efficiency and strong cross-domain generalization.
Experimental results show that the AHD Agent outperforms fixed-workflow and hand-crafted heuristic methods while using significantly fewer evaluator calls.

Agentic Reinforcement Learning for Autonomous Heuristic Discovery

Motivation and Context

Automatic heuristic design (AHD) seeks to automate the labor-intensive process of engineering heuristics for NP-hard combinatorial optimization problems (COPs). Prior approaches—including those leveraging LLMs as heuristic generators—operate within rigid, externally defined workflows, where LLMs play a passive role and receive fixed contextual inputs. This paradigm constrains LLMs’ ability to adaptively acquire state-dependent information, identify failure modes, or efficiently revise heuristics using feedback, thus limiting heuristic quality and search efficiency. Recent RL-based improvements have enhanced candidate generation but remain bound to fixed process templates or specific problem families.

The AHD Agent Framework

AHD Agent introduces a multi-turn, tool-integrated agentic framework that empowers an LLM to control the heuristic design process through state-dependent reasoning and feedback-driven adaptation. Crucially, the agent can:

Decide at each interaction whether to generate a new heuristic, invoke diagnostic tools, or revise prior candidates, rather than progressing through a prescribed workflow.
Adaptively acquire information via a library of diagnostic tools. These include instance analysis tools—querying characteristics (e.g., clustering, density, demand patterns) of the training set—and AST-based code novelty analyzers, allowing the agent to identify redundant attempts and prioritize structurally novel solutions.

The agent’s decision-making formulates AHD as a finite-horizon MDP. The state encompasses the evolving session history (diagnostic calls, evaluation results), the action space comprises token-level code editing, tool calls, and termination, and the reward is tied to the final heuristic’s scored performance.

Agentic Reinforcement Learning and Environment Synthesis

To train such an agent, the authors introduce agentic RL, specifically tuning Qwen3-4B-Instruct-2507 via GRPO in a scalable, multi-domain AHD environment:

Seed heuristic diversity: Both seed-guided and seed-free starts, exposing the agent to a variety of failure modes and initial solutions.
Instance diversity: Training sets span diverse structural features, promoting generalization.
Solver-backbone diversity: Solvers vary between constructive and ACO-based, decoupling acquired policies from problem-specific assumptions.

Reward shaping penalizes non-executable code and infeasible solutions while directly incentivizing improvements over baseline heuristics. Intermediate rewards are zero; all reward is attributed to the final selected code.

Experimental Results

Sample Efficiency and Search Quality

Across extensive benchmarks on combinatorial (e.g., TSP, CVRP, OP, MKP) and continuous (cost-aware Bayesian optimization, CAF) domains, the 4B-parameter RL-trained agent:

Matches or surpasses state-of-the-art (SOTA) LLM-based AHD systems (ReEvo, EOH, MCTS-AHD, CALM) despite using models an order of magnitude smaller and fewer evaluator calls.
Demonstrates robust generalization: when evaluated on domains and protocols excluded from RL training (e.g., OVRP-Constructive, MKP-ACO, CAF), the agent matches or exceeds the best previous results with minimal evaluations.
Outperforms both hand-crafted heuristics and standard acquisition strategies in CAF, illustrating strong cross-protocol transfer.

Noteworthy empirical claims include:

The agent achieves mean validation gaps often superior to SOTA LLM-based or evolutionary methods, with a fraction (e.g., 10–20 vs. 100–150) of evaluator calls.
Added diagnostic tools significantly improve agentic frameworks, while static injection in fixed-workflow methods yields inconsistent or even negative results, emphasizing the value of agency in tool use ((2605.08756), ablation studies).

Scaling Effects

Model and inference scaling analyses reveal:

The agentic multi-turn framework consistently benefits from stronger LLM backbones, exploiting enhanced reasoning when it controls the interaction, unlike fixed-workflow approaches that show non-monotonic backbone trends.
Increasing the RL training domain count boosts both in-domain and out-of-domain (held-out) generalization, with clear reductions in mean gap as RL training expands beyond single-domain specialization.

Parallel sampling and sequential refinement strategies for exploiting larger evaluator budgets are quantitatively compared, demonstrating that iterative refinement further enhances solution quality given more resources.

Theoretical and Practical Implications

This work establishes that endowing LLMs with agentic control—proactive state-dependent reasoning, dynamic tool use, and RL-based meta-optimization—enables effective AHD, challenging the longstanding paradigm of fixed-process evolutionary or programmatic search. The agent’s learned policy internalizes strategies for efficient exploration, code novelty management, and failure-mode diagnosis not encoded in static templates.

Theoretical implications:

Demonstrates that RL-trained LLMs can generalize design policies across problem distributions and solver protocols, supporting a move toward universal heuristic design agents.
Suggests that scaling RL training mixtures and model sizes can drive continued gains, providing a viable trajectory for LLM-based scientific and algorithmic discovery.

Practical implications:

Efficiently produces high-quality heuristics with reduced compute and inference cost, directly benefiting use cases in logistics, network design, and scientific optimization.
Practical deployment is enabled by the agent’s sample efficiency, backbone independence, and tool-augmented design adaptability.

Future Directions

Prospective advances include:

Scaling to larger base models and broader domain mixtures, potentially capturing more complex transfer behaviors and further reducing the RL-agent/SOTA model size gap.
Extension of the agentic RL framework to multi-modal or real-world optimization settings, integrating richer diagnostic toolchains and physical simulators.
Deeper study of interpretability and transparency, leveraging the agent’s interaction histories for insight into design strategies.

Conclusion

AHD Agent substantiates the case for agentic RL in automatic heuristic design, demonstrating that compact, RL-trained LLMs—when given control over multi-turn revision and dynamic tool use—yield state-of-the-art performance, broad generalization, and strong sample efficiency. The proposed methodology defines a new standard for autonomy and adaptability in LLM-driven scientific discovery and optimization (2605.08756).

Markdown Report Issue