Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Provably Learning from Language Feedback (2506.10341v1)

Published 12 Jun 2025 in cs.LG and cs.CL

Abstract: Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of LLM agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

Summary

  • The paper introduces a framework where agents learn from natural language feedback instead of scalar rewards to guide sequential decision-making.
  • It presents the HELiX algorithm, which integrates LLMs to balance exploration and exploitation, achieving provable regret bounds.
  • The study defines the Transfer Eluder Dimension to quantify feedback efficiency, highlighting language feedback’s advantage over traditional methods.

This paper introduces a formal framework for Learning from Language Feedback (LLF), a scenario where an agent learns to make sequential decisions by receiving natural language feedback instead of traditional scalar rewards. The core idea is to understand when and how agents can learn effectively in such settings, particularly with the advent of LLMs that can interpret and generate rich textual feedback.

Formal Setup of LLF

The LLF problem is defined as follows:

  • At each time step tt, an agent takes an action AtA_t from a finite set AA.
  • It receives language feedback OtO_t from a feedback space O\mathcal{O} (sequences of tokens), sampled from a true feedback distribution f(At)f^*(A_t).
  • A reward Rt=r(At)R_t = r^*(A_t) is incurred based on a true reward function rr^*, but this reward is not observed by the agent.
  • The agent's goal is to minimize regret: Regret(T)=t=0T1(RmaxEπt[Rt])\mathrm{Regret}(T) = \sum_{t=0}^{T-1} (R_{\max}^* - \mathbb{E}_{\pi_t} [R_t]), where Rmax=maxaAr(a)R_{\max}^* = \max_{a \in A} r^*(a).

To model the environment, the paper introduces:

  • Text Hypotheses (H\mathcal{H}): A (potentially very large) space of text descriptions ηHT+\eta \in \mathcal{H} \subset \mathcal{T}^+. Each hypothesis η\eta can describe the problem, the mechanism for generating feedback, or the underlying rules of the environment.
  • Reward Mapping (ηrη\eta \mapsto r_\eta): A known mapping that takes a hypothesis η\eta and outputs a reward function rη:A[0,1]r_\eta: A \to [0,1]. The agent has access to this mapping (Assumption 1), which can be implemented by an LLM that processes η\eta to predict rewards for actions. The true environment operates under an unknown true hypothesis η\eta^*.
  • Feedback Mapping (ηfη\eta \mapsto f_\eta): A mapping from a hypothesis η\eta to a feedback function fη:AΔ(O)f_\eta: A \to \Delta(\mathcal{O}). The agent does not know this mapping.

Measuring Information in Feedback and Key Assumptions

To enable learning, the following are crucial:

  1. Verifier (Assumption 2): The agent has access to a verifier, defined by a loss function :A×O×H[0,1]\ell: A \times \mathcal{O} \times \mathcal{H} \to [0,1]. (a,o,η)\ell(a, o, \eta) quantifies how well hypothesis η\eta aligns with feedback oo for action aa. If consistent, (a,o,η)=0\ell(a,o,\eta) = 0; otherwise, it's positive. This can be implemented by prompting an LLM to assess semantic consistency.
  2. Unbiased Feedback (Assumption 3): For any action aa and hypothesis η\eta, it is assumed that ηarg minηHEOfη(a)[(a,O,η)]\eta \in \argmin_{\eta' \in \mathcal{H}} \mathbb{E}_{O \sim f_\eta(a)} [\ell(a,O,\eta')]. This means that, on average, a hypothesis η\eta best explains the feedback fη(a)f_\eta(a) it generates, even if feedback is noisy.
  3. Hypothesis Equivalence: Two hypotheses η,η\eta, \eta' are equivalent if dH(η,η)=supa,o(a,o,η)(a,o,η)=0d_\mathcal{H}(\eta,\eta') = \sup_{a,o} |\ell(a,o,\eta) - \ell(a,o,\eta')| = 0.

Transfer Eluder Dimension

To quantify learning difficulty, the paper introduces the Transfer Eluder Dimension (TE(H,,ϵ)TE(\mathcal{H}, \ell, \epsilon)).

  • An action aa is ϵ\epsilon-transfer dependent on prior actions {a1,...,an}\{a_1, ..., a_n\} if any two hypotheses η,η\eta, \eta' that are "close" in terms of expected verifier loss on {a1,...,an}\{a_1, ..., a_n\} (i.e., i=1n(Eofη(ai)[(ai,o,η)]ηmin(ai))ϵ2\sum_{i=1}^n (\mathbb{E}_{o \sim f_{\eta'}(a_i)}[\ell(a_i, o, \eta)] - \ell_{\eta'}^{\min}(a_i)) \le \epsilon^2) also predict similar rewards for aa (i.e., rη(a)rη(a)ϵ|r_\eta(a) - r_{\eta'}(a)| \le \epsilon).
  • TE(H,,ϵ)TE(\mathcal{H}, \ell, \epsilon) is the length of the longest sequence of actions where each is ϵ\epsilon'-transfer independent of its predecessors for some ϵϵ\epsilon' \ge \epsilon.
  • A smaller transfer eluder dimension implies that feedback is more efficient at reducing uncertainty about rewards.
  • Informative feedback can exponentially reduce learning complexity. For example, in guessing an LL-bit string:
    • Reward-only feedback: O(2L)O(2^L) complexity.
    • Bitwise correctness feedback: TE(H,,ϵ)=1TE(\mathcal{H}, \ell, \epsilon) = 1.
  • If feedback is reward-informative (Definition 3: reward differences are lower-bounded by expected verifier loss differences), then TE(H,CF,ϵ)dimE(RH,ϵ)TE(\mathcal{H}, C_F \ell, \epsilon) \le \dim_E(\mathcal{R}_\mathcal{H}, \epsilon), where dimE\dim_E is the standard eluder dimension for the reward class RH\mathcal{R}_\mathcal{H}. This means LLF is no harder than RL if feedback contains reward information.

HELiX Algorithm

The paper proposes HELiX (Hypothesis Elimination using Language-informed Exploration), a UCB-style algorithm.

Theoretical Version (Algorithm 1):

  1. Initialization: Start with the full hypothesis space H0=H\mathcal{H}_0 = \mathcal{H}.
  2. Iteration tt:
    • Observe feedback Ot1O_{t-1} for action At1A_{t-1}.
    • Update Confidence Set: HtHt1{ηH:1ti(Ai,Oi,η)minηH1ti(Ai,Oi,η)ϵt}\mathcal{H}_t \gets \mathcal{H}_{t-1} \bigcap \{ \eta \in \mathcal{H} : \frac{1}{t} \sum_i \ell(A_i, O_i, \eta) - \min_{\eta' \in \mathcal{H}} \frac{1}{t} \sum_i \ell(A_i, O_i, \eta') \le \epsilon_t \}. This keeps hypotheses consistent with observed feedback history.
    • Check for Exploitation: Calculate a pessimistic policy πp\pi_p by solving minπmaxηHt[rη(πη)rη(π)]\min_{\pi} \max_{\eta \in \mathcal{H}_t} [r_\eta(\pi_\eta) - r_\eta(\pi)], where πη\pi_\eta is the optimal policy for hypothesis η\eta.
      • If the minimax regret is 0 (i.e., there's a consensus optimal action for all ηHt\eta \in \mathcal{H}_t), play Atπp()A_t \sim \pi_p(\cdot) (Exploitation).
      • Else (Exploration): Find an optimistic policy πo\pi_o and hypothesis ηo\eta_o via arg maxπmaxηHtrη(π)\argmax_{\pi} \max_{\eta \in \mathcal{H}_t} r_\eta(\pi). Play Atπo()A_t \sim \pi_o(\cdot).

Theoretical Guarantee (Theorem 1): HELiX achieves regret O~(T3/4(logN(H,ϵTH,dH))1/4TE(H,,ϵTH))\widetilde{O}( T^{3/4} (\log N(\mathcal{H},\epsilon_T^\mathcal{H},d_\mathcal{H}))^{1/4} \sqrt{TE(\mathcal{H},\ell,\epsilon_T^\mathcal{H})} ). The T3/4T^{3/4} rate is due to minimal assumptions on \ell; with stronger assumptions (e.g., convexity), O~(T)\widetilde{O}(\sqrt{T}) is possible.

Practical Implementation with LLMs (Algorithm 2, Figure 1 & 4):

This version uses an LLM for several components:

  1. πLLM\pi_{\mathrm{LLM}}: An LLM that generates a thought (hypothesis η\eta) and a corresponding action AA, given interaction history.
  2. πref\pi_{\mathrm{ref}}: A reference policy (e.g., random valid actions) for exploration diversity.
  3. RLLM(η,A)R_{\mathrm{LLM}}(\eta, A): An LLM-based reward mapping that scores action AA under hypothesis η\eta.

Steps:

  1. Sample Hypotheses and Actions:
    • Use πLLM\pi_{\mathrm{LLM}} to generate NN diverse thought-action pairs (H^t,initial actions)(\hat{\mathcal{H}}_t, \text{initial actions}).
    • Use πref\pi_{\mathrm{ref}} to generate MM additional random/exploratory actions. Let A^t\hat{A}_t be the set of all N+MN+M actions.
  2. Thought Cross-Verify:
    • Construct a score matrix [St]η,a=RLLM(η,a)[S_t]_{\eta,a} = R_{\mathrm{LLM}}(\eta, a) for all ηH^t,aA^t\eta \in \hat{\mathcal{H}}_t, a \in \hat{A}_t.
  3. Exploitation Step:
    • Find A^t=ηH^targmaxa[St]η,a\hat{A}_t^* = \bigcap_{\eta \in \hat{\mathcal{H}}_t} \arg\max_a [S_t]_{\eta,a}.
    • If A^t\hat{A}_t^* \neq \emptyset (consensus optimal action exists), choose At+1A_{t+1} from A^t\hat{A}_t^* (e.g., via tie-breaking).
  4. Exploration Step (if no consensus):
    • Optimistic Hypothesis Selection: Keep hypotheses H~tH^t\tilde{\mathcal{H}}_t \subseteq \hat{\mathcal{H}}_t whose best actions achieve the highest scores: H~t=argmaxηH^t(maxa[St]η,a)\tilde{\mathcal{H}}_t = \arg\max_{\eta \in \hat{\mathcal{H}}_t} (\max_a [S_t]_{\eta,a}).
    • Action Selection with Re-scoring (Advantage): Select At+1=argmaxaA^t(maxηH~t([St]η,aEa~πref[[St]η,a~]))A_{t+1} = \arg\max_{a \in \hat{A}_t} (\max_{\eta \in \tilde{\mathcal{H}}_t} ([S_t]_{\eta,a} - \mathbb{E}_{\tilde{a} \sim \pi_{\mathrm{ref}}} [[S_t]_{\eta,\tilde{a}}])). This re-centers scores by subtracting the average score of actions from πref\pi_{\mathrm{ref}} under each hypothesis, favoring actions with high advantage.
    • Tie-breaking prefers earlier generated hypotheses/actions.

Experiments

  • Environments: Wordle (modified feedback: only first incorrect char), Battleship (2D grid, locate 3 ships), Minesweeper (2D grid, avoid mines).
  • Baselines:
    • Greedy: ReAct-style, generates one hypothesis and one action.
    • HELiX (No exploitation step): Only optimistic exploration.
    • HELiX (No πref\pi_{\mathrm{ref}}): Omits re-scoring with a reference policy.
  • LLM Used: Claude-Sonnet-3.5 v2.
  • Results (Figure 2):
    • HELiX generally outperformed baselines, especially in Battleship and Minesweeper where strategic exploration is crucial.
    • The full HELiX (with exploitation and πref\pi_{\mathrm{ref}}) performed best.
    • Greedy LLM performed worse, highlighting the need for structured exploration.
  • Implementation Discussion: Relies on LLM's ability to:
    • Select optimal actions under a given hypothesis.
    • Produce fair scores across hypotheses.
    • Generate diverse and faithful hypotheses reflecting interaction history.
    • These are strong assumptions needing further validation.

Relationship to Existing Paradigms (Figure 3)

LLF is presented as a general framework subsuming:

  • Reinforcement Learning (RL): Feedback is the scalar reward.
  • Partial Monitoring: Agent observes abstract feedback; LLF handles unknown feedback mappings.
  • Interaction-Guided Learning (IGL): Rich feedback vector decodes latent reward; LLF models this via rηr_\eta and fηf_\eta.
  • Reward-Informative LLF: Latent reward is a function of observed (language) feedback.
  • Multi-Objective RL (MORL): Hypotheses can represent vector-valued rewards.
  • Preference-Based RL (PbRL): Feedback is pairwise preference; fηf_\eta can be a comparator.
  • Imitation Learning (IL): Expert actions are feedback; verifier loss measures deviation.

Limitations and Open Questions

  • Transfer eluder dimension is not a lower bound: A problem might have unbounded TETE but be trivially solvable (e.g., feedback always reveals an optimal action, but not full reward details). HELiX's exploitation step handles this, unlike naive UCB.
  • The true complexity of LLF might lie between worst-case reward identification and optimal behavior learning.
  • Developing a complexity measure that both lower-bounds regret and informs practical LLM-based algorithm design is an open question (e.g., adapting Decision-Estimation Coefficient (DEC) to LLF).

In summary, this paper provides a significant first step towards a principled understanding of learning from general language feedback. It formalizes the problem, introduces a relevant complexity measure (transfer eluder dimension), proposes a provably efficient algorithm (HELiX), and demonstrates its practical viability with LLMs. The work highlights how informative language feedback can be substantially more powerful than scalar rewards and lays theoretical groundwork for designing more sophisticated LLM agents that learn interactively.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com