Goal-Relabelling Techniques

Updated 3 September 2025

Goal-relabelling techniques are methods that retroactively adjust goals or labels to transform suboptimal data into valuable learning experiences.
They employ strategies like full parameter relabelling, allocation-based methods, and hindsight experience replay to enhance sample efficiency in both mixture models and RL.
These techniques connect to inverse reinforcement learning, imitation learning, and divergence minimization, broadening their impact on robust and efficient learning.

Goal-relabelling techniques are a central class of algorithms in statistical learning and reinforcement learning that retroactively reassign new goals, labels, or task specifications to existing data or agent trajectories, typically to maximize the value of otherwise unsuccessful or suboptimal experiences. The strategy exploits the invariance, redundancy, or compositionality of the underlying task structure—be it mixture model components, supervised targets, or reward functions. Originally developed to resolve the label switching problem in Bayesian mixture models, goal-relabelling is now pervasive in sample-efficient reinforcement learning (notably sparse reward or multi-goal contexts), imitation learning, and robust evaluation of deep neural networks. The range of approaches is diverse, spanning full parameter or allocation space relabelling in mixture models (Zhu et al., 2014), adversarial data relabelling in vision (Robinson et al., 2015), language-guided experience relabelling (Chan et al., 2019), model-based rollout and foresight relabelling (Zhu et al., 2021, Yang et al., 2021), causality-driven selection (Chuck et al., 6 May 2025), diversity-based sampling (Dai et al., 2021), as well as recent foundations relating relabelling to inverse RL, imitation learning, and divergence minimization (Eysenbach et al., 2020, Zhang et al., 2022). This article surveys these methodologies and their technical underpinnings.

1. Foundations and Motivations

Early goal-relabelling approaches originated from the need to resolve label switching in mixture models, where posteriors are invariant under permutations of component indicators or labels (Zhu et al., 2014). In this context, the relabelling problem ensures coherence of parameter estimates across Markov chain Monte Carlo (MCMC) samples, as mixture likelihoods—e.g., $p(x_i | \phi) = \sum_{k=1}^K w_k f(x_i | \theta_k)$ —are symmetric under label permutation.

In reinforcement learning, goal-relabelling became essential as a means of reusing agent experiences that do not accomplish the original goal, thereby dramatically improving performance in sparse reward or multi-goal environments (Chan et al., 2019, Yang et al., 2021). The trajectory or transition is relabelled with an alternative goal (frequently a goal achieved later in the trajectory or a fictitious goal) such that the experience provides meaningful learning signal. In both fields, relabelling addresses fundamental issues of identifiability (mixture models) or credit assignment under reward sparsity (RL).

Recent theoretical work has recast goal-relabelling frameworks as instances of divergence minimization, inverse reinforcement learning, or imitation learning, thus broadening their justification and opening new algorithmic designs (Eysenbach et al., 2020, Zhang et al., 2022).

2. Major Methodological Classes

2.1. Mixture Models: Full Parameter and Allocation Space Relabelling

In Bayesian mixture models, relabelling is required to correct for permutation-induced label switching. Two principal classes exist (Zhu et al., 2014):

Full Parameter Space Relabelling: Operates directly on the $q$ -dimensional vector $\phi = ((w_1, \theta_1), \dots, (w_K, \theta_K))$ . Approaches such as the Celeux et al. and Marin et al. methods align each MCMC sample by minimizing a distance (e.g., Euclidean, scalar product) between current and reference parameter configurations. The newly proposed minimum variance algorithm iteratively seeks the permutation $\nu^*$ minimizing posterior variance plus squared error to the mean:

$L(\phi, \hat{\phi}) = \text{var}(\phi) + (E(\phi) - \hat{\phi})^2$

with efficient mean and variance tracking for large datasets.

Allocation Space Relabelling: Acts on the latent allocation $z = (z_1, ..., z_n)$ , using references (e.g., modal allocations) and assignment optimizations (e.g., Hungarian algorithm, equivalence class maximization). Allocation-based methods are often preferable when component parameter space is high-dimensional.

2.2. Reinforcement Learning: Hindsight and Model-Based Relabelling

In goal-conditioned RL, several relabelling frameworks exist, often motivated by the challenge of learning under reward sparsity:

Hindsight Experience Replay (HER): Retroactively assigns the achieved goal $g'$ to failed episodes, so that experience originally gathered for one goal can contribute to learning for another (Chan et al., 2019). HER can operate with hand-crafted, state-derived, or (in ACTRCE) natural language goal representations.
Model-Based Foresight and Virtual Goal Relabelling: Techniques such as MapGo’s Foresight Goal Inference (FGI) (Zhu et al., 2021) and Model-based Hindsight Experience Replay (MHER) (Yang et al., 2021) use a learned dynamics model to simulate future states/goals or to generate virtual trajectories under the current policy. In FGI, simulated rollouts produce diverse and policy-relevant relabelled goals:

$\hat{g}_t = \varphi(\hat{s}_{t+K}), \quad \hat{s}_{t+K} \sim \mathcal{M}_\psi$

Causal/Interaction-Guided Relabelling: The HInt framework (Chuck et al., 6 May 2025) filters trajectories for relabelling based on inferred causality: only those that involve causal interaction (detected using null counterfactual inference—NCII) are relabelled as successful, thereby increasing the informativeness of relabelled data in object-centric domains.
Self-Supervised and Dense Reward Shaping: Some recent approaches combine relabelling with self-supervised representation learning and dense reward shaping, computing rewards as a function of the distance in a learned latent space between states and goals (Mezghani et al., 2023):

$r(s, g) = -\|f(s) - f(g)\|^2$

2.3. Advanced Sampling: Diversity and Selectivity

Diversity-based Sampling: Methods such as DTGSH (Dai et al., 2021) prioritize diversity in both trajectory and transition sampling, using determinantal point processes (DPPs) to sample minibatches that maximally cover the achieved (relabellable) goal space.
Demonstration-anchored and Task-Specific Relabelling: Approaches for long-horizon sequential manipulation anchor relabelling to task-specific goal distributions derived from demonstration databases, thereby improving exploration in complex tasks with “narrow passages” (Davchev et al., 2021).

3. Mathematical Formulations

Goal-relabelling approaches are characterized by their manner of assignment and optimization:

Parameter Alignments: Use minimization over permutations of distances, losses, or variances. For example, the minimum variance relabelling step:

$\nu^* = \arg\min_\nu \sum_{i=1}^q \text{var}\left(\{\phi_{\nu, i}^{[m+r-1]}, \phi_i^{(m+r)}\}\right)$

Relabelling in RL: Common to relabel transitions $(s_t, a_t, s_{t+1}, g)$ as $(s_t, a_t, s_{t+1}, g')$ for alternative $g'$ achieved later or predicted via a model. This is used as the basis for off-policy Bellman backups or for supervised learning losses.
Interaction-based Filtering: Null counterfactual interaction is determined by simulating a masked state where a candidate factor is replaced by its “null” state, and testing if the conditional distribution over the target variable changes:

$p(S_j = s_j' | s, a) \neq p(S_j = s_j' | s \circ s_i, a)$

Regularization via Action Priors: Recent approaches generate an action prior over $s, g$ pairs by aggregating behavior over all achieved goals from a trajectory (Lei et al., 8 Aug 2025):

$\pi^\text{HG prior}(a | s, g) = \frac{1}{K} \sum_{k=1}^K \pi'(a | s, g'_k)$

4. Empirical Analyses and Comparative Benchmarks

Empirical results consistently demonstrate that goal-relabelling improves learning efficiency and stability in both mixture modeling and reinforcement learning contexts.

In mixture models, minimum variance and Marin et al. relabelling methods offer the best trade-off between accuracy and computational cost for large datasets (Zhu et al., 2014), especially as measured by KL divergence and variance estimation.
In RL, policy sample efficiency and asymptotic performance are significantly increased by advanced relabelling (Chan et al., 2019, Yang et al., 2021, Zhu et al., 2021), particularly when model-based or diversity-guided selection is used. Causal or interaction-based selection (HInt+NCII (Chuck et al., 6 May 2025)) further enhances efficiency in object-centric tasks, empirically yielding up to 4× improvements over standard hindsight (Chuck et al., 6 May 2025).
Theoretical work shows that gradient-based relabelling and knowledge distillation approaches may reduce the number of required samples from $O(d^2)$ (classical) to $O(d)$ (gradient-augmented), where $d$ is the dimensionality of state or goal (Levine et al., 2022).

Empirical ablations show that integrating both self-imitation and action prior regularization yields improvements over using either alone (Lei et al., 8 Aug 2025). Careful attention is required when mixing behavioral cloning with value learning: naive mixtures can be detrimental unless selective imitation based on learned value signals is used (Zhang et al., 2022).

5. Theoretical Perspectives and Connections

Recent research positions goal-relabelling as a special case of:

Inverse Reinforcement Learning (IRL): Retrospectively assigning the “task” (or reward function) that makes a trajectory optimal is a form of posterior inference over tasks. For a joint $(\tau, \psi)$ (trajectory, task), the relabelling distribution is:

$q(\psi | \tau) \propto p(\psi) \exp\left(\sum_t r_\psi(s_t, a_t) - \log Z(\psi)\right)$

encompassing goal-reaching, tasks with discrete/linear reward functions, and beyond (Eysenbach et al., 2020).

Imitation Learning and Divergence Minimization: Relabelling can be understood as constructing an expert distribution (from hindsight-achieved goals) and minimizing an $f$ -divergence to align the policy distribution with it, as in behavioral cloning on relabelled data (Zhang et al., 2022). This connection explains the superior performance of Q-learning over vanilla behavior cloning in goal-reaching under relabelling and clarifies the impact of reward structures ({–1, 0} is superior to {0, 1} in this regime).
Knowledge Distillation: Goal-conditioned Q-function updates augmented by gradient-matching penalties propagate richer information and accelerate learning in high-dimensional goal spaces (Levine et al., 2022).

6. Practical Recommendations and Limitations

The choice of relabelling technique depends critically on the modeling context, data dimensionality, and computational resources:

For large-scale mixture models: Prefer full parameter relabelling (minimum variance or Marin et al.) for accuracy when $q$ is moderate; allocation-based methods become preferable as $q$ increases (Zhu et al., 2014).
For RL with sparse rewards: Advanced relabelling methods (model-based foresight, causal filtering, action prior regularization) yield higher sample efficiency (Zhu et al., 2021, Yang et al., 2021, Chuck et al., 6 May 2025, Lei et al., 8 Aug 2025).
For compositional or long-horizon tasks: Use demonstration-anchored or diversity-driven mechanisms for relabelling to focus exploration and improve credit assignment (Davchev et al., 2021, Dai et al., 2021).
For high-dimensional or nonstandard goal spaces: Techniques exploiting gradient-based distillation or dense reward shaping are increasingly essential (Levine et al., 2022, Mezghani et al., 2023).
Limitation: Computational overhead in some advanced methods (e.g., model rollouts, null counterfactual inference) remains nontrivial and model accuracy becomes critical when virtual trajectories are used for relabelling. Choice of reward structure and selective application of imitation can have substantial qualitative effects (Zhang et al., 2022).

7. Future Directions

Recent innovations emphasize several promising directions:

Causal reasoning and interaction-aware relabelling are expected to further close the gap between task specification and useful goal distribution alignment, especially in object-rich or multi-agent environments (Chuck et al., 6 May 2025).
Integration of self-supervised representations for more informative, dense reward and progress signals in offline RL or hierarchical setups (Mezghani et al., 2023).
Automated selection and curriculum learning: Diversity-based sampling and automatic curriculum mechanisms (based on coverage or progress measures) will likely be expanded, with DPPs and other principles already demonstrating improved sample efficiency (Dai et al., 2021).
Unified theoretical underpinnings: Continued development of divergence-based, imitation, and inverse RL connections promises to regularize and inform algorithm choice across domains (Eysenbach et al., 2020, Zhang et al., 2022).

In summary, goal-relabelling techniques now span a rich spectrum from classical mixture modeling to advanced reinforcement learning. They combine permutation-invariant statistical principles, off-policy credit assignment, and causal reasoning to maximize sample efficiency, generalization, and robustness—especially where data or reward signals are sparse, high dimensional, or structure-sensitive.