Emergent Textual Hindsight Experience Replay

Updated 3 September 2025

The paper introduces ETHER, an RL framework that extends HER to language-conditioned tasks by generating structured linguistic descriptions without oracle predicates.
The methodology combines emergent communication via referential games and semantic co-occurrence grounding to align artificial language with natural instruction semantics.
Experimental results in modified BabyAI tasks show that ETHER achieves higher sample efficiency, improved success ratios, and superior semantic alignment compared to prior methods.

Emergent Textual Hindsight Experience Replay (ETHER) is a reinforcement learning (RL) framework designed for natural language-conditioned instruction following. ETHER extends Hindsight Experience Replay (HER) to settings where goals, policies, and feedback are expressed in language, and overcomes limitations of previous textual HER formulations by integrating emergent communication protocols and semantic alignment strategies. The ETHER approach leverages both successful and unsuccessful RL trajectories, using unsupervised referential games and semantic co-occurrence grounding losses to generate artificial, structured linguistic descriptions that can be directly aligned with natural instruction formats. This enables agents to learn from all experiences—including failures—without reliance on human-annotated predicate functions, thereby substantially improving data efficiency and generalization for language-directed RL tasks (Denamganaï et al., 2023).

1. Motivation and Theoretical Foundations

Traditional HER addresses sample inefficiency in sparse-reward goal-conditioned RL tasks by relabeling failed episodes: goals are retroactively substituted with goals achieved during the trajectory, re-computing the associated rewards so that failures become learning opportunities (Wan et al., 2018). In classical spatial domains, this is tractable because both goals and achievements are elements of a structured, explicitly parameterized space.

In natural language instruction following, HER is more challenging. The linguistic nature of goals introduces ambiguity and a high-dimensional representational space. Early textual HER variants, such as HIGhER, require an oracle predicate—a function or external supervisor that determines whether a state satisfies a given instruction—to label successful outcomes. This reliance on external annotation for relabeling limits scalability, applicability, and data efficiency (Cideron et al., 2019).

ETHER addresses these limitations by:

Removing dependence on oracle predicates,
Employing emergent communication to construct predicate functions,
Generating linguistic labels for both failed and successful episodes,
Introducing an auxiliary discriminative visual referential game to bootstrap semantic alignment.

2. Methodological Components

ETHER architecturally combines several innovations to generalize textual hindsight experience replay:

Emergent Communication via Referential Games

In ETHER, a discriminative visual referential game is played in parallel to RL training. Two agents—a speaker and a listener—observe different views of a shared object or state. The speaker generates a message (sequence of discrete tokens chosen from a learnable vocabulary, sampled with a Straight-Through Gumbel-Softmax estimator), and the listener must identify the target among distractors based on the message. This process induces the emergence of an artificial language with compositional properties (Denamganaï et al., 2023).

Predicate Function Bootstrapping

The learned communication protocol from the referential game is leveraged to automatically construct an approximate predicate function. The listener agent, initially trained to interpret speaker messages, is repurposed to score whether a given state matches a relabeled goal phrase—effectively replacing the need for a manually specified state–goal predicate.

Semantic Co-occurrence Grounding

To ensure the artificial language aligns with natural instruction semantics used in the benchmark, ETHER introduces a semantic co-occurrence grounding loss. The loss encourages generated messages to share token co-occurrence statistics with natural language instructions:

$\mathcal{L}^{(\text{sem.})}_{\text{co-occ. ground}}(g|(\lambda_w)_{w\in V}) = \mathbb{E}_{s\sim\rho^\pi} \left[\sum_{w\in V} \mathcal{H}(w) \sum_{g_i \in g} \left( \mathbb{1}_w(g_i) - \frac{\lambda_w\cdot f(s)^\top}{\|\lambda_w\|_2 \|f(s)\|_2} \right)^2 \right]$

Here, $f(s)$ is the state encoding, $\lambda_w$ is the prior semantic vector for token $w$ , $\mathbb{1}_w(\cdot)$ indicates token occurrence with noise, and $\mathcal{H}(w)$ entropy-masks non-informative tokens. This loss pulls artificial language tokens toward visual features present in target instructions, promoting compositional grounding.

Trajectory Relabeling and Replay Buffer Augmentation

Both successful and unsuccessful RL episodes are relabeled using the emergent language. For successful episodes, final states and instructions are added to a supervised dataset used to train the speaker. Failures are relabeled by the speaker's generated post-hoc message, and the predicate (listener) is used to assign structured reward along the trajectory. This process fills the replay buffer with semantically rich, hindsight-modified transitions, applicable regardless of the original reward signal.

3. Empirical Findings and Comparative Performance

Experimental evaluation of ETHER has focused on modified BabyAI environments (notably the PickUpDist task), where agents must follow natural language instructions in a sparse-reward gridworld (Denamganaï et al., 2023). The central findings include:

Substantially improved end-task performance and sample efficiency over HIGhER and vanilla DQN baselines. Under a stringent observation budget (~200k samples), ETHER achieves final success ratios near 27%, compared to <18% in prior methods.
Enhanced coverage: By leveraging both failed and successful trajectories, ETHER relabels a larger fraction of experience, providing richer feedback during early RL phases when true successes are rare.
Superior semantic alignment: The semantic co-occurrence loss increases the Any-Colour metric (a proxy for alignment to natural color descriptors in BabyAI) to 32% for ETHER+, compared to 9% for ungrounded emergent protocols.
Robustness to lack of early successes: ETHER is less dependent on early successful episodes, improving stability and applicability in very sparse reward settings.

ETHER builds upon and generalizes a continuum of experience replay frameworks:

HER: Retrospective goal substitution to densify reward signals in goal-conditioned RL (Wan et al., 2018).
HIGhER: Extension of HER to language-conditioned RL using a learned instruction generator, but requiring an oracle predicate (Cideron et al., 2019).
Emergent Communication: Use of referential games and unsupervised communication to bootstrap language grounding and predicate formulation.
Alternative Replay Methods: Novel sampling techniques (e.g., Introspective Experience Replay) and prioritization strategies focus on recency, error magnitude, or surprise, but do not directly address the challenge of aligning emergent communication with natural language feedback in RL (Kumar et al., 2022).
HER in Symbolic Domains: Adaptations of hindsight replay in theorem proving show that textual and structural representations can facilitate learning even in highly abstract, non-grounded settings (Aygün et al., 2021).

5. Practical Implications and Application Domains

ETHER provides a methodology for learning interpretable, language-aligned representations in RL agents without external supervision. Implications include:

Scalability to real-world agents requiring language-grounded policies and feedback, such as domestic robots or collaborative agents in human–machine dialogue scenarios.
The ability of an agent to “explain” both its successes and failures in a compositional, human-interpretable form, enhancing transparency and collaborative debugging.
Application to any environment where explicit goal achievement is difficult to verify (e.g., open-ended dialog, program synthesis, or conceptual learning) by learning an intrinsic predicate via emergent protocols.

6. Limitations and Prospective Research Directions

Ether's current framework focuses on affirmative goal structures and relies on discriminative referential games for emergent communication. Extensions and open research areas include:

Broadening linguistic expressivity to handle negations, exclusions, and compositional disjunctions in instructions, beyond positive commands.
Integrating larger pre-trained LLMs for richer transfer and zero-shot generalization, and for improved semantic grounding.
Exploring more complex referential games and grounding functions for environments with multi-modal state spaces or higher-order logic.
Investigating the convergence properties and limits of predicate fidelity when artificial language is aligned with highly ambiguous or diverse instruction spaces.

7. Significance in the Context of Emergent Communication Research

ETHER demonstrates that unsupervised emergent communication can serve not only as an auxiliary task to accelerate goal-conditioned RL, but also as a mechanism to autonomously produce structured, language-aligned predicates for challenging instruction-following problems. This bridges the gap between symbolic language grounding, reinforcement learning, and multi-agent emergent communication, suggesting new directions for aligning artificial and natural languages in interactive learning systems (Denamganaï et al., 2023).

Attribute	HIGhER	ETHER	DQN Baseline
Oracle predicate required	Yes	No	N/A
Emergent communication	No	Yes	No
Handles failed trajectories	Limited	Yes	No
Natural language alignment	No	Yes (semantic grounding)	N/A
Sample efficiency	Low/Medium	High	Low

In summary, ETHER marks a substantive advance in sample-efficient, interpretable RL by reconciling emergent artificial languages with structured natural language objectives, thereby making HER broadly applicable to linguistically mediated learning settings.