Graph-based Self-Imitation Learning

Updated 11 May 2026

Graph-based self-imitation is a reinforcement and imitation learning approach that encodes agent trajectories into graphs to extract and replay high-value sub-sequences.
It employs planning, value propagation, and retrieval mechanisms to guide policy updates efficiently without relying solely on external expert demonstrations.
Empirical studies demonstrate improved sample efficiency, behavioral diversity, and policy robustness across robotics, goal-conditioned control, and resource orchestration domains.

Graph-based self-imitation is a class of reinforcement and imitation learning algorithms that leverage graph-structured representations of agent experience to drive efficient policy improvement by reusing an agent’s own best prior behaviors. These frameworks encode visited states and transitions, often augmented with similarity or domain priors, into a graph structure. By applying planning, value propagation, or retrieval within this graph, policies are optimized to imitate the most promising sub-trajectories or strategies identified from the agent’s own interaction dataset—eliminating sole reliance on external expert demonstrations or hand-crafted reward signals. This approach has demonstrated substantial gains in sample efficiency, behavioral diversity, and learning stability across robot manipulation, goal-conditioned control, and complex resource orchestration domains.

1. Fundamentals of Graph-based Self-Imitation

Graph-based self-imitation synthesizes two core ideas: (1) explicit or implicit graph construction over the agent’s trajectory data and (2) the systematic reuse of high-value or optimal sub-sequences discovered via planning or retrieval on this graph.

Given a set of agent experiences—either from online interaction or offline logs—states, goals, or visual embeddings are interpreted as vertices, and transitions or similarity links as edges, yielding a graph $G=(V,E)$ . Planning or search algorithms are then applied to extract paths, subgoals, or values. These outputs are used to bias or supervise the policy, usually via auxiliary behavioral cloning or policy distillation losses, or via selective replay of high-reward trajectories. Crucially, the approach supports self-supervised improvement, as it need not rely on external expert demonstrations.

2. Graph Construction and Planning Mechanisms

Graph construction varies based on task structure and learning paradigm:

In motion planning contexts, such as in Self-Imitation Learning by Planning (SILP), all collision-free states visited during a rollout become the vertex set, and edges are created between nearby states (within a distance threshold), typically weighted by Euclidean transition cost (Luo et al., 2021). Planning (A* or Dijkstra) is invoked to discover optimal paths from past data.
In goal-conditioned reinforcement learning, as in PIG (Planning-based Imitation Guidance), landmark graphs over state embeddings are constructed by farthest-point sampling from the replay buffer. Edges are added between landmarks within a fixed distance and weighted by value estimates, enabling multi-step optimal subgoal sequencing for planning (Kim et al., 2023).
In offline imitation with high-dimensional sensory data, GSR (Graph Search and Retrieval) first computes representations from a pretrained encoder, subsamples states to form nodes, connects via demonstration transitions, and then augments with similarity-based links in the embedding space. The graph supports exact shortest-path queries and value propagation for each segment (Yin et al., 2024).
In edge-AI orchestration and high-level planning (SIL-GPO), multiple graph modalities (deployment topologies, routing subgraphs, invocation graphs) are encoded via graph attention networks, providing a relational embedding for complex system state and control (Yang et al., 3 Mar 2026).

3. Self-Imitation Objectives and Policy Distillation

Policy improvement in graph-based self-imitation occurs via several mechanisms:

Behavior Cloning from Self-Planned Paths: After planning on the per-episode state graph, optimal trajectories are relabeled with computed actions (potentially via inverse kinematics or dynamics) and fed into a demonstration buffer. Behavioral cloning gradients are then taken only when these actions yield higher value than the current policy, as measured by a learned critic with Q-filtering (Luo et al., 2021).
Policy Distillation via Multi-Subgoal Imitation: Target-goal policies are explicitly encouraged to agree, in output space, with subgoal-conditioned policies selected along the planner’s multi-subgoal path. The auxiliary objective

$L_\mathrm{PIG}(\theta) = \mathbb{E}_{(s,\tau_g,g)\sim B}\left[\frac{1}{N-1} \sum_{k=2}^N \left\| \pi_\theta(s,g) - \text{StopGrad}(\pi_\theta(s, \ell^k)) \right\|_2^2\right]$

is linearly combined with the standard actor loss to drive target-goal policies toward subgoal expertise (Kim et al., 2023).

Retrieval-weighted Behavior Cloning: In GSR, each node’s K-nearest feature-space neighbors are weighted by both similarity and graph-propagated value, via a softmax allocation. These re-weighted transitions define a weighted maximum-likelihood objective for policy fitting, extending classic behavior cloning to incorporate global structure and task performance (Yin et al., 2024).
Prioritized Self-Imitation Replay: In SIL-GPO, entire high-return trajectories, as measured by global episodic reward, are stored. Subsequent policy updates are augmented with an extra imitation loss focused only on transitions associated with positive advantage relative to the value baseline (Yang et al., 3 Mar 2026).

4. Algorithmic Structures and Computational Considerations

The graph-based self-imitation pipeline includes the following canonical steps:

Graph Extraction: Construct state/goal/observation graphs online (per rollout) or offline (over replay/history).
Path or Value Extraction: Apply A*, Dijkstra, or soft value iteration to extract optimal paths, subgoal sequences, or value estimates.
Transition Relabeling and Retrieval: Reconstruct state-action(-next-state) tuples along planned paths, possibly with motion decomposition or action inversion; in retrieval settings, allocate clone targets based on neighbor value.
Policy Update Loop: For on-policy or off-policy RL, combine these imitation/retrieval updates with standard actor-critic (e.g., DDPG, SAC, PPO) or pure supervised updates.
Buffer and Hyperparameter Management: Maintain demonstration/high-return buffers and tune imitation coefficients. Sampling schemes, graph update frequency, and batch structure must all be managed for scalable learning (Luo et al., 2021, Kim et al., 2023, Yin et al., 2024, Yang et al., 3 Mar 2026).

For instance, SILP alternates between rollouts, graph construction, A* planning, and episodic relabeling for every epoch, with policy gradients blended from RL and behavior cloning sources. GSR applies a full sequence of embedding, graph build, Dijkstra value propagation, neighbor retrieval, and re-weighted maximum likelihood, all as a batch preprocessing step for offline learning.

5. Empirical Outcomes and Practical Advantages

Graph-based self-imitation has yielded pronounced empirical gains:

In long-horizon robotic control (e.g., multi-stage motion planning with obstacles), SILP achieves success rates exceeding 98% on complex test tasks, both in simulation and sim-to-real transfer, while requiring 30–35% fewer environment steps than pure RL or HER-based methods. Online relabeling with graph planning reduces demonstration mismatch and streamlines convergence (Luo et al., 2021).
Within goal-conditioned RL benchmarks (e.g., AntMaze, Pusher), PIG achieves up to 2× faster learning and 3× higher asymptotic success rates compared to HER, GCSL, L3P and other non-graphical or non-imitation-augmented baselines. Policy generalization is evidenced by strong performance even if the planner is removed at test time (Kim et al., 2023).
In offline dexterous robot learning, GSR achieves 10–30% higher success rates and 30% improved proficiency across a range of manipulation tasks with complex visual observations, without requiring Q-learning or risk of the deadly triad. The graph-based retrieval and value propagation process offers inherent stability and performance guarantees (Yin et al., 2024).
In orchestration of edge-AI microservices, SIL-GPO delivers 19–35% lower end-to-end latency versus advanced metaheuristic and RL baselines, with robust gains in resource utilization across diverse workloads and topologies (Yang et al., 3 Mar 2026).

6. Variations, Extensions, and Limitations

Multiple instantiations of graph-based self-imitation are distinguished by technical choices:

Edge and Graph Semantics: Edges may be transition-based, similarity-based, or determined by environment constraints; some frameworks employ multi-graph encoders for complex relational reasoning (Yang et al., 3 Mar 2026).
Loss Weighting and Filtering: The benefit of Q-filtering or advantage thresholding is to prevent the self-imitation loss from propagating suboptimal or outdated behaviors. Hyperparameter ablations show stable learning is preserved for moderate imitation weights (e.g., $\lambda_2 \lesssim 0.05$ in SILP) (Luo et al., 2021).
Stochastic Subgoal Skipping: Skipping subgoals according to a value- or imitation-gap-adaptive probability further increases exploration, accelerating learning and policy robustness (Kim et al., 2023).

Key limitations cited in the literature include reliance on model-based planning or accurate dynamics (SILP), dependency on the quality and coverage of either the learned embeddings or suboptimal demonstrations (GSR), and computational overhead associated with graph construction, nearest-neighbor search, or maintenance of large high-return buffers (SIL-GPO). Failed or suboptimal self-demos can degrade performance if not properly filtered.

A plausible implication is that future research may benefit from integrating model-free and model-based planning, hierarchical or dynamically pruned graph structures, learned representations for edge weighting, or automated adaptation of imitation and retrieval schemes to scale beyond tens of thousands of nodes and to domains with unstructured or high-dimensional observation spaces.

Graph-based self-imitation unifies advantages over conventional imitation learning, which is bottlenecked by the need for laborious human demonstrations and limited coverage, and over pure RL, which struggles with sparse rewards and inefficient exploration. Unlike methods reliant solely on replay buffers or classic behavior cloning, these approaches inject optimality and diversity by recomposing trajectories graphically, enabling the discovery of multiple valid strategies and robust adaptation.

Distinctive aspects of major algorithms include:

SILP: On-the-fly per-episode graph planning and Q-filtered self-imitation (Luo et al., 2021).
PIG: Graph-based multi-subgoal policy distillation and stochastic subgoal-skipping for long-horizon tasks (Kim et al., 2023).
GSR: Purely offline graph search with value-propagated, retrieval-weighted behavior cloning in high-dimensional state spaces (Yin et al., 2024).
SIL-GPO: Graph neural network-encoded state abstractions and prioritized high-return trace replay for combinatorial resource orchestration (Yang et al., 3 Mar 2026).

Graph-based self-imitation has thus emerged as a general paradigm for leveraging structural and relational structure in sequential decision tasks, offering principled avenues for sample-efficient, stable, and scalable agent training.

Markdown Report Issue Upgrade to Chat

References (4)

Self-Imitation Learning by Planning (2021)

Imitating Graph-Based Planning with Goal-Conditioned Policies (2023)

Offline Imitation Learning Through Graph Search and Retrieval (2024)

Hybrid Orchestration of Edge AI and Microservices via Graph-based Self-Imitation Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph-based Self-Imitation.