ISR-LLM: Iterative Self-Refinement for LLM Planning

Updated 2 May 2026

ISR-LLM is an LLM-centric planning framework that enhances long-horizon, discrete decision-making by iteratively refining natural language instructions into validated action plans.
It converts instructions to PDDL, generates initial plans with chain-of-thought reasoning, and employs self or external validators for targeted plan correction.
Empirical evaluations in domains like Cooking, Blocksworld, and BallMoving demonstrate up to a 50 percentage point improvement in success over direct LLM-based planners.

ISR-LLM is an LLM-centric task planning framework designed to enhance the feasibility and correctness of long-horizon, discrete, sequential decision-making problems through an iterative self-refinement process. Its application domain is grounded in robotics and automated planning, where the translation of natural-language instructions into executable, validated multi-step action plans is essential. ISR-LLM advances the state of the art by introducing explicit feedback-driven plan correction, achieving substantial improvements in success rates over conventional LLM-based planners working directly from natural language instructions (Zhou et al., 2023).

1. Formal Problem Setting

ISR-LLM addresses deterministic, fully observable, discrete long-horizon sequential planning tasks. The planning problem is modeled as the 5-tuple

$P = \langle S, A, T, s_{\mathrm{init}}, G\rangle,$

where $S$ is a set of states, $A$ is a set of actions, $T: S \times A \rightarrow S$ is a deterministic transition function, $s_{\mathrm{init}} \in S$ is the initial state, and $G \subseteq S$ is a set of goal states.

A plan $\pi = (a_1, a_2, ..., a_n)$ is a solution if $s_0 = s_{\mathrm{init}}$ , $s_{i+1} = T(s_i, a_i)$ for $0 \leq i < n$ , and $S$ 0. The challenge for LLM-based planners in this setting is the combinatorial complexity induced by large $S$ 1 and long horizon $S$ 2, with the additional constraint that only actions $S$ 3 whose preconditions hold in state $S$ 4 are applicable (Zhou et al., 2023).

2. Architecture and Methodology

ISR-LLM operationalizes a three-stage pipeline:

Preprocessing (LLM Translator): Converts natural language instructions into formal PDDL (Planning Domain Definition Language) specifications. Utilizing in-context few-shot prompting, the LLM outputs both a domain file $S$ 5 (types, predicates, action schemas) and a problem file $S$ 6 (objects, initial state, goals).
Planning (LLM Planner): Employs chain-of-thought prompting and few-shot examples to generate an initial action plan $S$ 7 based on the translated PDDL input. Steps are enumerated with interleaved CoT-style intermediate state reasoning.
Iterative Self-Refinement: Introduces a validator $S$ 8, which may be LLM-based (self-validation) or a classical symbolic PDDL validator (external), to assess plan feasibility. The validator pinpoints the first incorrect action index and provides an error reason $S$ 9. The planner then refines the plan by incorporating both error feedback and prior plan context, iterating until a valid plan is achieved or a maximum iteration threshold $A$ 0 is reached. Below is the schema:

$A$ 1

Pseudocode provided in the original work details LLM_Translate, LLM_Plan, Validate_and_Feedback, LLM_Refine, and the main pipeline (Zhou et al., 2023).

3. Validation Strategies and Feedback Loops

Two validator strategies are considered:

Self-validator: The LLM critiques its own output using few-shot examples and error identification prompts. This validator is “plug-and-play” but may hallucinate or provide imprecise feedback.
External validator: An engineered symbolic PDDL validator (e.g., VAL) executes and formally checks each proposed plan, returning precise failure indices and reasons when infeasibility arises.

Feedback from the validator takes the form of an error index and textual cause, facilitating targeted plan correction in subsequent planning iterations. Iteration proceeds until successful validation or exhaustion of the iteration budget.

4. Experimental Results and Quantitative Evaluation

ISR-LLM was benchmarked on three domains: Cooking, Blocksworld, and BallMoving, each with increasing numbers of pots, blocks, or balls (object count $A$ 2). The evaluation covered the three planning methods (LLM-direct, ISR-LLM-self, ISR-LLM-external), and two LLMs (GPT-3.5 and GPT-4) over 30 random seed instances per configuration. Success was defined as yielding a valid plan according to formal specification.

A summary of domain-wise success rates is shown below:

Domain	GPT3.5-direct	GPT3.5-self	GPT3.5-external	GPT4-direct	GPT4-self	GPT4-external
Cooking (3 pots)	47%	67%	100%	100%	100%	100%
Cooking (4 pots)	40%	53%	63%	100%	100%	100%
Blocksworld (3)	20%	37%	70%	43%	60%	97%
Blocksworld (4)	10%	17%	53%	40%	60%	80%
BallMoving (3)	33%	50%	70%	93%	100%	100%
BallMoving (4)	17%	27%	57%	90%	93%	97%

Key findings include:

ISR-LLM-self increases success by 15–20 percentage points (pp) (GPT-3.5) and ~20 pp (GPT-4) compared to LLM-direct.
ISR-LLM-external offers an additional 40–50 pp improvement in the most challenging domains, reflecting the utility of symbolic validation.
Success degrades as the plan horizon and search space grow.
Paired t-tests confirm all ISR enhancements are statistically significant over LLM-direct at $A$ 3 (Zhou et al., 2023).

5. Limitations and Observed Challenges

ISR-LLM’s bottlenecks and critical observations are:

Validator tradeoff: The external validator achieves higher success but requires manual engineering of domain validators. The self-validator is flexible and easily deployable but less precise.
Domain sensitivity: Domains demanding deep logical inference, such as Blocksworld with $A$ 4, remain challenging for LLMs being enhanced only by self-refinement.
Scalability: The approach is constrained by LLM context window and prompt length as the size of $A$ 5 or $A$ 6 increases.
Stochasticity and repeatability: LLMs exhibit non-deterministic responses even at zero temperature, potentially leading to instabilities in solution quality.
Lack of formal guarantees: Unlike symbolic solvers, plans are empirically but not formally verified and cannot be guaranteed feasible for all instances or safety-critical applications.

6. Implications, Extensions, and Future Research

ISR-LLM points to several lines of future inquiry:

Integrating parametric fine-tuning of LLMs using large-scale planning corpora for improved domain-specific accuracy.
Hybrid symbolic–neural approaches where LLM-generated plans serve as heuristics or seeds for formal classical planners.
Extension to Task and Motion Planning (TAMP), where discrete symbolic planning over PDDL is grounded into continuous robot execution.
Developing more robust and expressive verification loops employing Answer Set Programming (ASP) or Satisfiability Modulo Theories (SMT) validators.

A plausible implication is that the iterative self-refinement architecture generalizes to broad LLM planning settings, not only within discrete modular domains but potentially to hybrid symbolic–subsymbolic pipelines in robotics and AI planning. Current limitations around scalability, formal verifiability, and domain generalization remain open research questions (Zhou et al., 2023).

7. Comparative Assessment and Position within LLM Planning Ecosystem

ISR-LLM’s iterative, validator-driven design contrasts with “single-shot” LLM planners, which translate natural language to plan in one pass. Empirical results demonstrate that ISR-LLM closes the gap between generic LLM-based planners and domain-specific symbolic solvers, especially as task complexity and horizon increase.

This position is orthogonal to frameworks such as the LLM–VLM fusion architecture, which integrates symbolic planners and vision-LLMs for autonomous ISR in robotics, but does not center on explicit dynamical self-refinement or Plan–Validate–Refine cycles (Din et al., 19 Jan 2026). Both paradigms represent trajectories for bringing natural-language flexible planning to long-horizon, high-reliability tasks in robotics.

ISR-LLM therefore establishes a programmatic, LLM-agnostic approach for reliable sequential task planning in settings where classical planners struggle with ambiguous instructions and LLMs alone are insufficiently grounded (Zhou et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon Sequential Task Planning (2023)

LLM-VLM Fusion Framework for Autonomous Maritime Port Inspection using a Heterogeneous UAV-USV System (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ISR-LLM.