Iterative Distillation of Experience

Updated 4 July 2026

Iterative distillation of experience is a process that transforms raw interaction traces into structured, reusable artifacts such as chain-of-thought rationales and optimized trajectories.
It integrates approaches from self-play, knowledge distillation, and reinforcement learning to iteratively refine policy performance and enhance agent behavior.
The method involves systematic artifact extraction, consolidation, and reuse while addressing challenges like computational overhead and memory management.

As these works suggest, iterative distillation of experience denotes a family of procedures in which interaction traces, teacher rationales, self-play episodes, intermediate checkpoints, or memory artifacts are repeatedly transformed into reusable supervision and then fed back into a later policy, student, or agent. In this literature, the distilled object is often richer than labels or final answers: it can include chain-of-thought rationales, optimized trajectories plus abstractions, textual skills, prompt–critique–reward tuples, shortcut memories, or hierarchical memory items. The common motif is a closed loop in which behavior produces experience, experience is compressed or refined, and the compressed artifact changes subsequent behavior (Jain et al., 3 Apr 2025, Sarch et al., 2024, Ye et al., 17 Mar 2026, Fan et al., 11 May 2026).

1. Historical lineage and conceptual scope

One important lineage comes from self-play and policy improvement. Expert Iteration (ExIt) already instantiated an iterative search-to-policy loop in which a learned apprentice policy $\pi_{\boldsymbol{\theta}}$ guided MCTS, MCTS induced a target policy from visit counts, and the resulting self-play experience was replayed to improve the apprentice; the paper then studied weighting by episode duration, prioritized replay, and exploratory data collection as ways of changing which self-play experience is most strongly distilled into the policy (Soemers et al., 2020).

A second lineage comes from knowledge distillation in supervised learning. “Experience Ensemble Knowledge Distillation” defines the teacher’s “experience” as the sequence of intermediate teacher models saved during optimization, and distills an adaptive ensemble of those checkpoints rather than only the final converged teacher (Wang et al., 2022). “Iterative Self Knowledge Distillation” then makes the recurrence explicit: the previous student becomes the next teacher, the next student is reinitialized from ImageNet rather than warm-started, and the process repeats until there is no obvious accuracy gain (Peng, 2022). In these works, experience is not an external trajectory but the optimization path of the teacher itself.

Recent LLM and agent work generalizes the same idea from checkpoint histories to reasoning traces, prompt revisions, deployment trajectories, and memory operations. UNDO reframes rationale distillation as iterative teacher–student optimization over student errors (Jain et al., 3 Apr 2025). ICAL converts noisy multimodal demonstrations into optimized, annotated “programs of thought” (Sarch et al., 2024). OEL extracts transferable experiential knowledge from deployment traces and consolidates it into weights (Ye et al., 17 Mar 2026). Evolving-RL trains both experience extraction and experience utilization as a coupled RL process (Fan et al., 11 May 2026). This suggests that the phrase now covers a broader class of systems in which the central design variable is not merely what model to train, but which compressed form of prior experience should supervise the next round.

2. Canonical loop and distilled artifacts

Across domains, the loop has three recurrent components. First, the current system produces evidence about its competence: self-play states, failed tool trajectories, validation-set rationales, deployment logs, or prompt outcomes. Second, that evidence is converted into a more reusable artifact than the raw trace itself. Third, the artifact is reused through supervised fine-tuning, policy optimization, retrieval-augmented execution, or memory-conditioned prompting.

The variety lies in what counts as the distilled artifact. UNDO treats the artifact as refined chain-of-thought rationales targeted to student weaknesses rather than generic teacher behavior (Jain et al., 3 Apr 2025). ICAL stores an optimized trajectory plus abstractions such as summary, abstracted state, step-by-step reasoning, predicted state change, and abstraction comments (Sarch et al., 2024). OEL extracts “experiential knowledge” from trajectories and later uses it as privileged context for distillation (Ye et al., 17 Mar 2026). OPD-Evolver treats selection, writing, and maintenance of memory as trainable behaviors in their own right (Zhang et al., 16 Jun 2026). Prompt-policy optimization stores prompt, critique, and reward tuples in a contrastive experience buffer so that iterative prompt refinement can be amortized into policy weights (Sayana et al., 14 May 2026).

Setting	Source experience	Distilled artifact
Reasoning distillation	Student errors and prior rationales	Refined teacher rationales
Embodied or multimodal agents	Noisy demonstrations	Optimized trajectories plus abstractions
Online deployment	Interaction trajectories	Experiential knowledge
Self-evolving agents	Trajectories and critiques	Memory items or skills
Prompt optimization	Prompt trials with critiques	Buffer entries and policy weights

A recurrent claim in this literature is that raw experience is rarely the optimal transfer object. ICAL argues that raw demonstrations are poor memories because they can be noisy, incomplete, inefficient, or overly scene-specific (Sarch et al., 2024). OEL shows that extracted experiential knowledge is much more effective than raw trajectories, both in context and after consolidation (Ye et al., 17 Mar 2026). UNDO makes the analogous point for rationales: a rationale that is valid for solving the problem is not necessarily the rationale most useful for this particular student to learn from (Jain et al., 3 Apr 2025). The general pattern is therefore not mere replay, but repeated experience-to-artifact transformation.

3. Optimization formulations

UNDO gives one of the clearest optimization statements. In the one-shot baseline, the student $p_{sm}^{\theta}$ is fine-tuned by maximum likelihood on teacher rationales,

$\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$

Its main claim is that this imitation loss is reasonable, but the imitated data distribution is wrong because it is not conditioned on the student’s weaknesses. The iterative version changes the teacher distribution itself by conditioning on gap information $\Delta_i^{(k)}$ , and the student repeatedly minimizes the KL divergence between that refined teacher distribution and its own rationale distribution (Jain et al., 3 Apr 2025).

OEL formalizes the same idea in a deployment setting with a two-stage loop. Knowledge extraction is recursive,

$e_i' \sim \pi_{\mathrm{extract}}(\cdot \mid \tau_i, e_{i-1}), \qquad e_i=[e_{i-1};e_i'],$

and consolidation is on-policy context distillation via reverse KL on student-generated outputs,

$\mathcal{L}(\theta)= \mathbb{E}_{x \sim \mathcal{D},\, e \sim \mathcal{C},\, y \sim \pi_\theta(\cdot \mid x)} \left[ \frac{1}{|y|}\sum_{t=1}^{|y|} D_{\mathrm{KL}}\!\left( \pi_\theta(\cdot \mid x,y_{<t}) \Big\| \pi_{\mathrm{teacher}}(\cdot \mid e,x,y_{<t}) \right) \right].$

Here the teacher is the pre-update model with privileged access to experiential knowledge $e$ , while the student must internalize that effect into its parameters (Ye et al., 17 Mar 2026).

Evolving-RL casts experience distillation directly as RL over extraction and utilization. From a source trajectory, the extractor samples $N$ candidate skills $e_i$ ; each skill is evaluated on $K$ retrieved related tasks; the extractor reward is

$p_{sm}^{\theta}$ 0

and group-normalized advantages drive a clipped GRPO-style extractor loss. A separate solver loss compares skill-conditioned trajectories on each retrieved task, and the two are combined as

$p_{sm}^{\theta}$ 1

The intended effect is coordinated co-evolution: better extraction changes the skill distribution the solver sees, and better solver behavior sharpens the transfer signal available to extraction (Fan et al., 11 May 2026).

A related but distinct formulation appears in alignment. Faster WIND shows that iterative best-of- $p_{sm}^{\theta}$ 2 distillation can be recast as a regularized self-play win-rate game, replacing repeated sample-level best-of- $p_{sm}^{\theta}$ 3 selection with parameter-space optimization of a dominant-policy objective. The paper’s main algorithmic claim is that two sampled responses per prompt can suffice for the practical approximation, giving a sample-efficient route to the limiting policy of iterative BOND (Yang et al., 2024). In black-box prompt optimization, the same theme appears as a KL-regularized policy objective over prompts,

$p_{sm}^{\theta}$ 4

with historical prompts, critiques, and rewards supplied through a context-specific buffer (Sayana et al., 14 May 2026).

4. Representation, retrieval, and memory maintenance

The representational choice is often the decisive factor. ICAL uses highly structured, domain-specific artifacts. In TEACh, experiences can be executable Python-like programs with comments, subgoals, object-state updates such as change_state(), and abstraction comments; in web and video settings, experiences contain plan, summary, abstracted state, predicted next state, action, and causal commentary. Retrieval uses a weighted combination of instruction, textual-state, and visual similarity, with cosine similarity computed from text-embedding-ada-002 for text and CLIP ViT-B/32 for images (Sarch et al., 2024).

Mem $p_{sm}^{\theta}$ 5Evolve separates Asset Memory from Experience Memory. The latter is formally

$p_{sm}^{\theta}$ 6

and is updated after each task by

$p_{sm}^{\theta}$ 7

The paper’s central systems claim is that distilled experience should guide later asset creation, while new assets create broader experience to distill in later rounds (Cheng et al., 13 Apr 2026).

OPD-Evolver pushes this further by treating memory lifecycle operations as trainable behaviors. Its repository is four-level,

$p_{sm}^{\theta}$ 8

with retrieval, selection, writing, and maintenance all controlled by the same policy. A central construct is outcome-calibrated memory attribution,

$p_{sm}^{\theta}$ 9

which scores whether a memory actually improved outcomes when selected rather than merely retrieved. Slow-loop distillation then uses privileged hindsight to teach the deployable policy how to select useful memories, write reusable ones, and maintain repository quality (Zhang et al., 16 Jun 2026).

The software-agent literature provides an external-memory analogue. IER represents experience as shortcut triples extracted from non-adjacent states in a solution chain and studies two inheritance schemes: a successive pattern that propagates only the nearest predecessor’s experience pool and a cumulative pattern that propagates the union of all prior pools. It then adds heuristic elimination based on information gain and retrieval frequency, showing that memory-space management is not ancillary but part of the refinement problem itself (Qian et al., 2024).

5. Empirical behavior, transfer, and efficiency

UNDO reports that iterative rationale refinement consistently beats one-shot distillation on mathematical reasoning. For Qwen-2.5-1.5B, average accuracy across GSM8K, MATH, MMLU-Pro, and SVAMP rises from $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 0 at iteration 1 to $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 1 at iteration 3; for Llama-1B, it rises from $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 2 to $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 3. The same paper also shows that simply training longer on the original rationale data is not equivalent: extending standard distillation to 10 epochs degrades Qwen’s average to $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 4, while UNDO at the same effective training budget reaches $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 5. Final refined teacher data are partly reusable across students, and the method also improves StrategyQA and TheoremQA despite the in-domain math focus (Jain et al., 3 Apr 2025).

ICAL provides analogous evidence in embodied and multimodal settings. On TEACh it reaches 35.1 SR / 49.3 GC, compared with 17.2 / 26.6 for raw visual demos and 26.5 / 29.5 for raw kinesthetic demos. On VisualWebArena it reaches 22.7% average success, versus 14.3% for GPT4V+SoM. On Ego4D it improves over few-shot GPT-4V on noun and action edit distance while remaining competitive with supervised models. Ablations removing the abstraction phase or the human-in-the-loop phase reduce TEACh performance to 29.4 / 44.9 and 29.9 / 41.0, respectively, indicating that both iterative cleanup and iterative correction matter (Sarch et al., 2024).

In black-box prompt optimization, the experience-buffer variant improves both final performance and convergence speed on reasoning tasks. On Dyck Languages, the baseline is 63.33%, GEPA reaches 79.16%, and the prompter policy with buffer reaches 91.25%. On Web of Lies, the same progression is 52.50% $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 6 67.50% $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 7 90.12%. The buffer yields markedly faster convergence: on Dyck, the no-buffer policy converges at step 195, whereas the buffered variant converges at 102; on DQA, the corresponding numbers are 180 and 75 (Sayana et al., 14 May 2026).

OEL gives the clearest evidence that extracted experiential knowledge is more effective than raw traces. On Sokoban with Qwen3-4B-Instruct-2507, raw trajectory context reaches 10.9% pass rate and drops to 7.8% after consolidation, whereas extracted knowledge reaches 18.2% in context and 21.4% after consolidation. On Frozen Lake for Qwen3-1.7B, self-derived knowledge yields 23.8% in-context pass rate and 31.1% after consolidation, outperforming knowledge extracted by the larger Qwen3-4B. The paper also reports that average response length drops to roughly 70% of the initial level by the third iteration, indicating improved token efficiency as experience is internalized (Ye et al., 17 Mar 2026).

Evolving-RL shows that the gains from co-evolution are not confined to explicit memory use. On ALFWorld unseen tasks, “Ours (w/ skills)” reaches 88.6% success, versus 44.6% for GRPO with skills; “Ours (w/o skills)” still reaches 81.1%, versus 33.7% for GRPO. On Mind2Web, overall action accuracy rises from 22.83% for GRPO to 30.87% with skills and 28.05% without skills. Ablations show that extractor-only training improves seen performance but not unseen transfer, while solver-only training internalizes some experience yet largely ignores explicit skill injection; the full co-evolution is needed to unlock both internalization and test-time reuse (Fan et al., 11 May 2026).

Alignment and compression studies show the same pattern in different form. WIND improves consistently over three iterations on GSM8K and MT-Bench while SPPO and a J-BOND variant show regressions or non-monotonic behavior, and it does so with two responses per prompt rather than repeated best-of- $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 8 sampling (Yang et al., 2024). Iterative layer-wise distillation for Qwen2.5-3B reaches 36 $\mathcal{L}_{L}(\theta)= - \mathbb{E}_{(q_i,\, r_i)\sim\mathcal{D}_{LLM}} \Bigg[\sum_{t=1}^{M_i}\log p_{sm}^{\theta}(r_{i,t}\mid r_{i,<t},q_i,I)\Bigg].$ 9 28 layers with only 9.7% quality loss and 36 $\Delta_i^{(k)}$ 0 24 layers with about 18% loss, whereas static or non-distilled pruning baselines collapse (Kovalev et al., 7 Nov 2025).

6. Limitations, controversies, and open problems

The literature also makes clear that iterative distillation is not synonymous with guaranteed improvement. UNDO is computationally expensive on the teacher side—about 2,800 GH200 GPU hours per iteration for teacher generation—and its characterization of student weakness comes from only 20 validation examples because the full history must fit into the teacher prompt. The paper also uses a coarse binary correctness signal and provides no theoretical convergence guarantee beyond empirical saturation around iteration 3 (Jain et al., 3 Apr 2025).

ICAL depends on similar environments or tasks so that retrieval can find useful analogies, uses a fixed action API, and may fail on extremely misleading demonstrations or feedback. In passive settings such as Ego4D, the absence of a human-in-the-loop verification phase weakens the safeguard against bad abstraction (Sarch et al., 2024). Mem $\Delta_i^{(k)}$ 1Evolve makes the co-evolutionary thesis explicit, but its experience-memory update is simply set union, with no specified deduplication, conflict resolution, forgetting policy, or exact retrieval formula for experience items; the framework also depends on a sandbox environment for executing autonomously generated code (Cheng et al., 13 Apr 2026). OPD-Evolver’s attribution mechanism is outcome-calibrated rather than causal, its maintenance component is less concretely specified than selection and execution, and the slow-loop hindsight teacher is privileged in ways unavailable at deployment (Zhang et al., 16 Jun 2026).

A recurring controversy is that stronger teachers or larger memory pools do not necessarily yield better students or better evolution. EEKD reports the “surprising conclusion” that strong ensemble teachers do not necessarily produce strong students (Wang et al., 2022). ISKD shows, conversely, that a student can still benefit from a moderately trained teacher, though all of its evidence is confined to supervised image classification and some runs plateau or regress after several rounds (Peng, 2022). IER makes the same trade-off concrete for external memory: the successive pattern may yield superior results but is vulnerable to bad batch-to-batch refinement, whereas the cumulative pattern is more stable but suffers from memory dilution unless aggressively pruned (Qian et al., 2024).

These results suggest a broader open problem. Iterative distillation works best when the distilled artifact is simultaneously transferable, student- or policy-aligned, cheap enough to regenerate, and structured enough to support retrieval or optimization. The harder question is how to satisfy all four requirements at once, especially when evaluation is noisy, the environment is inaccessible, or the memory/teacher distribution changes after each update. The literature surveyed here treats that question not as a single algorithmic trick, but as a systems problem spanning representation, retrieval, credit assignment, optimization, and stopping criteria.