Provably Learning from Language Feedback (2506.10341v1)

Published 12 Jun 2025 in cs.LG and cs.CL

Abstract: Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of LLM agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

Summary

The paper introduces a framework where agents learn from natural language feedback instead of scalar rewards to guide sequential decision-making.
It presents the HELiX algorithm, which integrates LLMs to balance exploration and exploitation, achieving provable regret bounds.
The study defines the Transfer Eluder Dimension to quantify feedback efficiency, highlighting language feedback’s advantage over traditional methods.

This paper introduces a formal framework for Learning from Language Feedback (LLF), a scenario where an agent learns to make sequential decisions by receiving natural language feedback instead of traditional scalar rewards. The core idea is to understand when and how agents can learn effectively in such settings, particularly with the advent of LLMs that can interpret and generate rich textual feedback.

Formal Setup of LLF

The LLF problem is defined as follows:

At each time step $t$ , an agent takes an action $A_t$ from a finite set $A$ .
It receives language feedback $O_t$ from a feedback space $\mathcal{O}$ (sequences of tokens), sampled from a true feedback distribution $f^*(A_t)$ .
A reward $R_t = r^*(A_t)$ is incurred based on a true reward function $r^*$ , but this reward is not observed by the agent.
The agent's goal is to minimize regret: $\mathrm{Regret}(T) = \sum_{t=0}^{T-1} (R_{\max}^* - \mathbb{E}_{\pi_t} [R_t])$ , where $R_{\max}^* = \max_{a \in A} r^*(a)$ .

To model the environment, the paper introduces:

Text Hypotheses ( $\mathcal{H}$ ): A (potentially very large) space of text descriptions $\eta \in \mathcal{H} \subset \mathcal{T}^+$ . Each hypothesis $\eta$ can describe the problem, the mechanism for generating feedback, or the underlying rules of the environment.
Reward Mapping ( $\eta \mapsto r_\eta$ ): A known mapping that takes a hypothesis $\eta$ and outputs a reward function $r_\eta: A \to [0,1]$ . The agent has access to this mapping (Assumption 1), which can be implemented by an LLM that processes $\eta$ to predict rewards for actions. The true environment operates under an unknown true hypothesis $\eta^*$ .
Feedback Mapping ( $\eta \mapsto f_\eta$ ): A mapping from a hypothesis $\eta$ to a feedback function $f_\eta: A \to \Delta(\mathcal{O})$ . The agent does not know this mapping.

Measuring Information in Feedback and Key Assumptions

To enable learning, the following are crucial:

Verifier (Assumption 2): The agent has access to a verifier, defined by a loss function $\ell: A \times \mathcal{O} \times \mathcal{H} \to [0,1]$ . $\ell(a, o, \eta)$ quantifies how well hypothesis $\eta$ aligns with feedback $o$ for action $a$ . If consistent, $\ell(a,o,\eta) = 0$ ; otherwise, it's positive. This can be implemented by prompting an LLM to assess semantic consistency.
Unbiased Feedback (Assumption 3): For any action $a$ and hypothesis $\eta$ , it is assumed that $\eta \in \argmin_{\eta' \in \mathcal{H}} \mathbb{E}_{O \sim f_\eta(a)} [\ell(a,O,\eta')]$ . This means that, on average, a hypothesis $\eta$ best explains the feedback $f_\eta(a)$ it generates, even if feedback is noisy.
Hypothesis Equivalence: Two hypotheses $\eta, \eta'$ are equivalent if $d_\mathcal{H}(\eta,\eta') = \sup_{a,o} |\ell(a,o,\eta) - \ell(a,o,\eta')| = 0$ .

Transfer Eluder Dimension

To quantify learning difficulty, the paper introduces the Transfer Eluder Dimension ( $TE(\mathcal{H}, \ell, \epsilon)$ ).

An action $a$ is $\epsilon$ -transfer dependent on prior actions $\{a_1, ..., a_n\}$ if any two hypotheses $\eta, \eta'$ that are "close" in terms of expected verifier loss on $\{a_1, ..., a_n\}$ (i.e., $\sum_{i=1}^n (\mathbb{E}_{o \sim f_{\eta'}(a_i)}[\ell(a_i, o, \eta)] - \ell_{\eta'}^{\min}(a_i)) \le \epsilon^2$ ) also predict similar rewards for $a$ (i.e., $|r_\eta(a) - r_{\eta'}(a)| \le \epsilon$ ).
$TE(\mathcal{H}, \ell, \epsilon)$ is the length of the longest sequence of actions where each is $\epsilon'$ -transfer independent of its predecessors for some $\epsilon' \ge \epsilon$ .
A smaller transfer eluder dimension implies that feedback is more efficient at reducing uncertainty about rewards.
Informative feedback can exponentially reduce learning complexity. For example, in guessing an $L$ $L$ -bit string:
- Reward-only feedback: $O(2^L)$ complexity.
- Bitwise correctness feedback: $TE(\mathcal{H}, \ell, \epsilon) = 1$ .
If feedback is reward-informative (Definition 3: reward differences are lower-bounded by expected verifier loss differences), then $TE(\mathcal{H}, C_F \ell, \epsilon) \le \dim_E(\mathcal{R}_\mathcal{H}, \epsilon)$ , where $\dim_E$ is the standard eluder dimension for the reward class $\mathcal{R}_\mathcal{H}$ . This means LLF is no harder than RL if feedback contains reward information.

HELiX Algorithm

The paper proposes HELiX (Hypothesis Elimination using Language-informed Exploration), a UCB-style algorithm.

Theoretical Version (Algorithm 1):

Initialization: Start with the full hypothesis space $\mathcal{H}_0 = \mathcal{H}$ .
Iteration $t$ :
- Observe feedback $O_{t-1}$ for action $A_{t-1}$ .
- Update Confidence Set: $\mathcal{H}_t \gets \mathcal{H}_{t-1} \bigcap \{ \eta \in \mathcal{H} : \frac{1}{t} \sum_i \ell(A_i, O_i, \eta) - \min_{\eta' \in \mathcal{H}} \frac{1}{t} \sum_i \ell(A_i, O_i, \eta') \le \epsilon_t \}$ . This keeps hypotheses consistent with observed feedback history.
- Check for Exploitation: Calculate a pessimistic policy $\pi_p$ $π_{p}$ by solving $\min_{\pi} \max_{\eta \in \mathcal{H}_t} [r_\eta(\pi_\eta) - r_\eta(\pi)]$ , where $\pi_\eta$ $π_{η}$ is the optimal policy for hypothesis $\eta$ $η$ .
  - If the minimax regret is 0 (i.e., there's a consensus optimal action for all $\eta \in \mathcal{H}_t$ ), play $A_t \sim \pi_p(\cdot)$ (Exploitation).
  - Else (Exploration): Find an optimistic policy $\pi_o$ and hypothesis $\eta_o$ via $\argmax_{\pi} \max_{\eta \in \mathcal{H}_t} r_\eta(\pi)$ . Play $A_t \sim \pi_o(\cdot)$ .

Theoretical Guarantee (Theorem 1): HELiX achieves regret $\widetilde{O}( T^{3/4} (\log N(\mathcal{H},\epsilon_T^\mathcal{H},d_\mathcal{H}))^{1/4} \sqrt{TE(\mathcal{H},\ell,\epsilon_T^\mathcal{H})} )$ . The $T^{3/4}$ rate is due to minimal assumptions on $\ell$ ; with stronger assumptions (e.g., convexity), $\widetilde{O}(\sqrt{T})$ is possible.

Practical Implementation with LLMs (Algorithm 2, Figure 1 & 4):

This version uses an LLM for several components:

$\pi_{\mathrm{LLM}}$ : An LLM that generates a thought (hypothesis $\eta$ ) and a corresponding action $A$ , given interaction history.
$\pi_{\mathrm{ref}}$ : A reference policy (e.g., random valid actions) for exploration diversity.
$R_{\mathrm{LLM}}(\eta, A)$ : An LLM-based reward mapping that scores action $A$ under hypothesis $\eta$ .

Steps:

Sample Hypotheses and Actions:
- Use $\pi_{\mathrm{LLM}}$ to generate $N$ diverse thought-action pairs $(\hat{\mathcal{H}}_t, \text{initial actions})$ .
- Use $\pi_{\mathrm{ref}}$ to generate $M$ additional random/exploratory actions. Let $\hat{A}_t$ be the set of all $N+M$ actions.
Thought Cross-Verify:
- Construct a score matrix $[S_t]_{\eta,a} = R_{\mathrm{LLM}}(\eta, a)$ for all $\eta \in \hat{\mathcal{H}}_t, a \in \hat{A}_t$ .
Exploitation Step:
- Find $\hat{A}_t^* = \bigcap_{\eta \in \hat{\mathcal{H}}_t} \arg\max_a [S_t]_{\eta,a}$ .
- If $\hat{A}_t^* \neq \emptyset$ (consensus optimal action exists), choose $A_{t+1}$ from $\hat{A}_t^*$ (e.g., via tie-breaking).
Exploration Step (if no consensus):
- Optimistic Hypothesis Selection: Keep hypotheses $\tilde{\mathcal{H}}_t \subseteq \hat{\mathcal{H}}_t$ whose best actions achieve the highest scores: $\tilde{\mathcal{H}}_t = \arg\max_{\eta \in \hat{\mathcal{H}}_t} (\max_a [S_t]_{\eta,a})$ .
- Action Selection with Re-scoring (Advantage): Select $A_{t+1} = \arg\max_{a \in \hat{A}_t} (\max_{\eta \in \tilde{\mathcal{H}}_t} ([S_t]_{\eta,a} - \mathbb{E}_{\tilde{a} \sim \pi_{\mathrm{ref}}} [[S_t]_{\eta,\tilde{a}}]))$ . This re-centers scores by subtracting the average score of actions from $\pi_{\mathrm{ref}}$ under each hypothesis, favoring actions with high advantage.
- Tie-breaking prefers earlier generated hypotheses/actions.

Experiments

Environments: Wordle (modified feedback: only first incorrect char), Battleship (2D grid, locate 3 ships), Minesweeper (2D grid, avoid mines).
Baselines:
- Greedy: ReAct-style, generates one hypothesis and one action.
- HELiX (No exploitation step): Only optimistic exploration.
- HELiX (No $\pi_{\mathrm{ref}}$ ): Omits re-scoring with a reference policy.
LLM Used: Claude-Sonnet-3.5 v2.
Results (Figure 2):
- HELiX generally outperformed baselines, especially in Battleship and Minesweeper where strategic exploration is crucial.
- The full HELiX (with exploitation and $\pi_{\mathrm{ref}}$ ) performed best.
- Greedy LLM performed worse, highlighting the need for structured exploration.
Implementation Discussion: Relies on LLM's ability to:
- Select optimal actions under a given hypothesis.
- Produce fair scores across hypotheses.
- Generate diverse and faithful hypotheses reflecting interaction history.
- These are strong assumptions needing further validation.

Relationship to Existing Paradigms (Figure 3)

LLF is presented as a general framework subsuming:

Reinforcement Learning (RL): Feedback is the scalar reward.
Partial Monitoring: Agent observes abstract feedback; LLF handles unknown feedback mappings.
Interaction-Guided Learning (IGL): Rich feedback vector decodes latent reward; LLF models this via $r_\eta$ and $f_\eta$ .
Reward-Informative LLF: Latent reward is a function of observed (language) feedback.
Multi-Objective RL (MORL): Hypotheses can represent vector-valued rewards.
Preference-Based RL (PbRL): Feedback is pairwise preference; $f_\eta$ can be a comparator.
Imitation Learning (IL): Expert actions are feedback; verifier loss measures deviation.

Limitations and Open Questions

Transfer eluder dimension is not a lower bound: A problem might have unbounded $TE$ but be trivially solvable (e.g., feedback always reveals an optimal action, but not full reward details). HELiX's exploitation step handles this, unlike naive UCB.
The true complexity of LLF might lie between worst-case reward identification and optimal behavior learning.
Developing a complexity measure that both lower-bounds regret and informs practical LLM-based algorithm design is an open question (e.g., adapting Decision-Estimation Coefficient (DEC) to LLF).

In summary, this paper provides a significant first step towards a principled understanding of learning from general language feedback. It formalizes the problem, introduces a relevant complexity measure (transfer eluder dimension), proposes a provably efficient algorithm (HELiX), and demonstrates its practical viability with LLMs. The work highlights how informative language feedback can be substantially more powerful than scalar rewards and lays theoretical groundwork for designing more sophisticated LLM agents that learn interactively.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (6)

Tweets

https://twitter.com/allenainie/status/1946616016911696209

YouTube

Show All Videos