Dynamic Heuristic Biasing in Planning

Updated 16 May 2026

Dynamic Heuristic Biasing is a planning approach that adapts heuristic estimates by integrating historical and contextual search data for improved efficiency.
It relies on dynamic admissibility, consistency, and monotonicity principles to ensure optimality while minimizing redundant node expansions.
Integration with learning techniques and LLMs enables real-time heuristic updates, enhancing applications in symbolic, neuromorphic, and risk-aware planning.

Dynamic heuristic biasing in planning refers to any approach in which the heuristic function used to guide search is made sensitive to history- or context-dependent information that evolves as search progresses. Instead of assigning a static estimate $h(s)$ based solely on the state $s$ , dynamic heuristics incorporate broader information $h(s,h)$ where $h$ encodes additional knowledge, such as encountered landmarks, historic transitions, trajectory information, or recently learned patterns. This allows the planning algorithm—most commonly an A $^*$ variant—to improve efficiency, informativeness, or robustness by adapting heuristic estimates on the fly in response to search dynamics. Rigorous formalizations, optimality guarantees, and practical algorithms for dynamic heuristic biasing have emerged in both symbolic and learning-based planning literature, as well as in recent LLM-guided and neuromorphic planning contexts.

1. Formalization and Theoretical Foundations

Dynamic heuristic biasing generalizes classical static heuristic search by explicitly introducing a history- or information-dependent heuristic function: $h_d: S \times H \to \mathbb{R}_{\geq 0} \cup \{\infty\}$ where $S$ is the finite set of search states, $H$ is the set of information ("history") objects obtainable by sequences of updates and refinements along search trajectories, and $h_d(s,h)$ provides the cost-to-go estimate for state $s$ given current information $s$ 0 (Christen et al., 29 Apr 2025). Information updates and refinements are realized as $s$ 1 for transitions $s$ 2 and $s$ 3 for explicit information improvements at state $s$ 4.

A dynamic heuristic must satisfy analogues of standard admissibility/consistency properties, parameterized over reachable $s$ 5 pairs:

Dynamic safety: $s$ 6 no path to goal from $s$ 7.
Dynamic admissibility: $s$ 8 (true minimum cost-to-go).
Dynamic goal-awareness: $s$ 9 for all $h(s,h)$ 0.
Dynamic consistency: $h(s,h)$ 1 for transitions $h(s,h)$ 2.
Dynamic monotonicity: $h(s,h)$ 3 and $h(s,h)$ 4.

This structure underpins a generic forward search framework (Algorithm G), and when specialized to dynamic A $h(s,h)$ 5, yields rigorous soundness, completeness, and optimality theorems when the invariants above are met. Dynamic monotonicity and consistency guarantee zero reopens if re-evaluation of improved heuristic values is performed (Christen et al., 29 Apr 2025).

2. Generic and Specialized Dynamic Planning Algorithms

In the canonical dynamic A $h(s,h)$ 6 instantiation, open and closed lists are maintained as usual, but each state is represented as a tuple $h(s,h)$ 7 where $h(s,h)$ 8 reflects the current, potentially refined heuristic derived from the evolving information context (Christen et al., 29 Apr 2025). Upon each node expansion:

The dynamic heuristic is refined (potentially using expensive computations such as Bellman-error checks or abstraction refinement).
If the refined $h(s,h)$ 9 increases, the node is re-inserted with updated $h$ 0.
For each successor, both parent and heuristic information are updated, and successors are inserted or reinserted as appropriate.

Variations include multi-queue A $h$ 1 where multiple dynamically learned or symbolic heuristics are balanced (via alternation, RL policies, or learned residuals), and best-first or portfolio schemes where the ordering of expansions is controlled by an adaptive policy over the current heuristic landscape (Brugnara et al., 19 May 2025, Speck et al., 2020).

Notable dynamic heuristic approaches include:

Landmark progression (LM-A*): Heuristic is state plus set-of-remaining-landmarks; updates remove achieved landmarks, yielding dynamically admissible but non-statically-consistent heuristics—A $h$ 2 with reopening (Christen et al., 29 Apr 2025).
LTL and PDB-based dynamic heuristics: History- (or abstraction-) aware progressions are updated or refined in response to new path discoveries or explicit abstraction refinement, with consistency/monotonicity conditions ensuring optimality and efficient search (Christen et al., 29 Apr 2025).

3. Integration with Learning: Policy- and Value-Based Biasing

In learning-based approaches, dynamic heuristic biasing is realized by treating heuristic selection or estimation as a stateful decision process:

Dynamic heuristic selection as MDP: The choice of heuristic at each expansion is formalized as a Markov Decision Process (MDP) where the planner state encodes open-list statistics for each heuristic (e.g., $h$ 3, $h$ 4, etc.). An RL agent (e.g., DQN) learns a policy for which heuristic's open-list to pop from, yielding empirically and theoretically superior scaling compared to static or periodic alternation (Speck et al., 2020).
Learning residuals over symbolic heuristics: In temporal planning, RL can be used to learn a correction term $h$ 5 (residual) to a classical symbolic heuristic $h$ 6, giving $h$ 7 where $h$ 8 encodes known heuristics and $h$ 9 is learned by TD methods. The planning phase balances systematic and learned heuristics via multiple queues, dynamically alternating between best-first and A $^*$ 0 expansions (Brugnara et al., 19 May 2025).

In LLM-driven planning, dynamic biasing is applied in guiding the search towards both feasible and cost-effective plans via learned heuristics for affordance and expected payoff, with history-sensitive evaluation at every expansion (Hazra et al., 2023).

4. LLM-Based Dynamic Heuristic Biasing

Dynamic heuristic biasing has emerged as a crucial strategy in planning frameworks integrating LLM-generated priors:

BT Expansion with LLM Guidance (HBTP): The LLM predicts task-specific action sequences and related predicates/objects. During search, actions and action-space are pruned using LLM predictions, and real-time sequence scores reweight expansion order; mechanisms for feedback and correction (action re-expansion, LLM re-query on failures) yield robust, near-optimal BT generation even in the presence of LLM errors (Cai et al., 2024).
SayCanPay: LLM action proposals are scored by dynamically re-evaluated feasibility ("Can") and payoff ("Pay") heuristics. Each expansion's history-sensitive heuristic evaluation steers search away from infeasible or low-reward branches, producing plans that are both feasible and cost-effective (Hazra et al., 2023).
SRAH for Risk-Aware Navigation: LLM reasoning about environmental risk is distilled into real-time computable semantic cost maps. These semantic risks dynamically bias the edge-cost function in A $^*$ 1 (favoring open corridors, penalizing bottlenecks), with rapid local updating as new obstacles are sensed and closed-loop replanning is triggered in response (Durrani et al., 4 May 2026). The result is robust, low-latency path planning in dynamic, uncertain environments.

5. Algorithmic and Computational Considerations

Dynamic heuristic biasing generally entails both algorithmic and computational trade-offs:

Refinement cost: Each transition or expansion may require an expensive or nonlocal update (e.g., abstraction refinement, PDB enlargement, LTL progression), dramatically increasing per-node computation (Christen et al., 29 Apr 2025).
Memory overhead: Large or complex history objects (e.g., per-state predicate caches, high-dimensional BCPNN states, or massive LLM commonsense libraries) can drive up memory usage, motivating "progression-based" schemes and selective policy learning (Christen et al., 29 Apr 2025, Zhang et al., 1 Feb 2026, Cai et al., 2024).
Stability and convergence: Many methods assume convergence of repeated refinement actions at each state; in practice, this is often enforced by bounding the number of refinements per state or using staged expansion budgets (Christen et al., 29 Apr 2025, Cai et al., 2024).
Re-evaluation and reopening: Mechanisms to avoid node reopening (enabled by enforcing monotonicity and consistency) can double priority-queue operations, but eliminate redundant expansions (Christen et al., 29 Apr 2025).
Empirical complexity: For instance, in model-based A* with dynamic energy heuristics, integrating accurate domain-dependent models for auxiliary loads or drag can halve node expansions compared to static heuristics (Ajanovic et al., 2017).

6. Applications and Empirical Results

Dynamic heuristic biasing has been validated across a spectrum of planning tasks:

Domain/Method	Speedup vs. Static	Quality/Optimality	Key Mechanism
Symbolic classical planning (Christen et al., 29 Apr 2025)	2–3x fewer expansions	Optimal/zero-reopen (if conditions met)	Landmark progression, abstraction, etc.
BT planning with LLM (Cai et al., 2024)	2–3 orders of magnitude (planning time)	Near-optimal (+0.3% BT cost)	LLM-based action pruning, dynamic cost
RL-based heuristic selection (Speck et al., 2020)	Exponential (theory); 10–20% empirical	Strictly dominates best static/alternating	DQN policy, MDP selection
Neuromorphic EUA (Zhang et al., 1 Feb 2026)	Real-time convergence (<1000 timesteps)	$^*$ 2\% from Gurobi optimal	BCPNN dynamic bias, WTA, fill/capacity
LLM risk-aware navigation (Durrani et al., 4 May 2026)	$^*$ 3\% relative to greedy BFS	Small trade-off, robust in cluttered env	Real-time semantic cost update, closed-loop
LLM feasibility/cost (SayCanPay) (Hazra et al., 2023)	+40pp success, +97% plan optimality	Maintains feasibility/cost-effectiveness	Dynamic history-dependent scoring

Empirical benchmarks confirm that dynamic heuristic biasing can dramatically reduce search expansions and runtime, with larger gains in settings where the base heuristic is weak, there is high structural diversity across instances, or informative external priors (LLM predictions, domain-specific heuristics) are available.

7. Limitations and Future Directions

While dynamic heuristic biasing unifies a broad class of planning enhancements under a common formalism, challenges remain:

Learning overhead and generalization: RL-based dynamic policies require significant training data, online adaptation, and careful hyperparameter tuning (Speck et al., 2020, Brugnara et al., 19 May 2025).
Heuristic and information explosion: Rich history spaces may create substantial memory and computational burdens. Design of compact, progression-based, or learnable representations is essential (Christen et al., 29 Apr 2025).
Convergence and stability: Dynamic refinement and feedback (especially in LLM-in-the-loop systems) can lead to oscillatory or overly conservative behavior if not adequately stabilized (Cai et al., 2024).
Integration with continuous control, stochastic domains, and partial observability: Existing frameworks primarily target deterministic, discrete-state planning; extensions are needed for continuous, partially observed, or adversarial settings (Ajanovic et al., 2017, Durrani et al., 4 May 2026).
Guarantees vs. flexibility: Many dynamic biasing schemes achieve near-optimality but not strict optimality unless additional structural invariants (consistency, monotonicity) are enforced with computational cost (Christen et al., 29 Apr 2025).