Provably Learning from Language Feedback (2506.10341v1)
Abstract: Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of LLM agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.
Summary
- The paper introduces a framework where agents learn from natural language feedback instead of scalar rewards to guide sequential decision-making.
- It presents the HELiX algorithm, which integrates LLMs to balance exploration and exploitation, achieving provable regret bounds.
- The study defines the Transfer Eluder Dimension to quantify feedback efficiency, highlighting language feedback’s advantage over traditional methods.
This paper introduces a formal framework for Learning from Language Feedback (LLF), a scenario where an agent learns to make sequential decisions by receiving natural language feedback instead of traditional scalar rewards. The core idea is to understand when and how agents can learn effectively in such settings, particularly with the advent of LLMs that can interpret and generate rich textual feedback.
Formal Setup of LLF
The LLF problem is defined as follows:
- At each time step t, an agent takes an action At from a finite set A.
- It receives language feedback Ot from a feedback space O (sequences of tokens), sampled from a true feedback distribution f∗(At).
- A reward Rt=r∗(At) is incurred based on a true reward function r∗, but this reward is not observed by the agent.
- The agent's goal is to minimize regret: Regret(T)=t=0∑T−1(Rmax∗−Eπt[Rt]), where Rmax∗=maxa∈Ar∗(a).
To model the environment, the paper introduces:
- Text Hypotheses (H): A (potentially very large) space of text descriptions η∈H⊂T+. Each hypothesis η can describe the problem, the mechanism for generating feedback, or the underlying rules of the environment.
- Reward Mapping (η↦rη): A known mapping that takes a hypothesis η and outputs a reward function rη:A→[0,1]. The agent has access to this mapping (Assumption 1), which can be implemented by an LLM that processes η to predict rewards for actions. The true environment operates under an unknown true hypothesis η∗.
- Feedback Mapping (η↦fη): A mapping from a hypothesis η to a feedback function fη:A→Δ(O). The agent does not know this mapping.
Measuring Information in Feedback and Key Assumptions
To enable learning, the following are crucial:
- Verifier (Assumption 2): The agent has access to a verifier, defined by a loss function ℓ:A×O×H→[0,1]. ℓ(a,o,η) quantifies how well hypothesis η aligns with feedback o for action a. If consistent, ℓ(a,o,η)=0; otherwise, it's positive. This can be implemented by prompting an LLM to assess semantic consistency.
- Unbiased Feedback (Assumption 3): For any action a and hypothesis η, it is assumed that η∈η′∈HargminEO∼fη(a)[ℓ(a,O,η′)]. This means that, on average, a hypothesis η best explains the feedback fη(a) it generates, even if feedback is noisy.
- Hypothesis Equivalence: Two hypotheses η,η′ are equivalent if dH(η,η′)=a,osup∣ℓ(a,o,η)−ℓ(a,o,η′)∣=0.
Transfer Eluder Dimension
To quantify learning difficulty, the paper introduces the Transfer Eluder Dimension (TE(H,ℓ,ϵ)).
- An action a is ϵ-transfer dependent on prior actions {a1,...,an} if any two hypotheses η,η′ that are "close" in terms of expected verifier loss on {a1,...,an} (i.e., i=1∑n(Eo∼fη′(ai)[ℓ(ai,o,η)]−ℓη′min(ai))≤ϵ2) also predict similar rewards for a (i.e., ∣rη(a)−rη′(a)∣≤ϵ).
- TE(H,ℓ,ϵ) is the length of the longest sequence of actions where each is ϵ′-transfer independent of its predecessors for some ϵ′≥ϵ.
- A smaller transfer eluder dimension implies that feedback is more efficient at reducing uncertainty about rewards.
- Informative feedback can exponentially reduce learning complexity. For example, in guessing an L-bit string:
- Reward-only feedback: O(2L) complexity.
- Bitwise correctness feedback: TE(H,ℓ,ϵ)=1.
- If feedback is reward-informative (Definition 3: reward differences are lower-bounded by expected verifier loss differences), then TE(H,CFℓ,ϵ)≤dimE(RH,ϵ), where dimE is the standard eluder dimension for the reward class RH. This means LLF is no harder than RL if feedback contains reward information.
HELiX Algorithm
The paper proposes HELiX (Hypothesis Elimination using Language-informed Exploration), a UCB-style algorithm.
Theoretical Version (Algorithm 1):
- Initialization: Start with the full hypothesis space H0=H.
- Iteration t:
- Observe feedback Ot−1 for action At−1.
- Update Confidence Set: Ht←Ht−1⋂{η∈H:t1i∑ℓ(Ai,Oi,η)−η′∈Hmint1i∑ℓ(Ai,Oi,η′)≤ϵt}. This keeps hypotheses consistent with observed feedback history.
- Check for Exploitation: Calculate a pessimistic policy πp by solving πminη∈Htmax[rη(πη)−rη(π)], where πη is the optimal policy for hypothesis η.
- If the minimax regret is 0 (i.e., there's a consensus optimal action for all η∈Ht), play At∼πp(⋅) (Exploitation).
- Else (Exploration): Find an optimistic policy πo and hypothesis ηo via πargmaxη∈Htmaxrη(π). Play At∼πo(⋅).
Theoretical Guarantee (Theorem 1): HELiX achieves regret O(T3/4(logN(H,ϵTH,dH))1/4TE(H,ℓ,ϵTH)). The T3/4 rate is due to minimal assumptions on ℓ; with stronger assumptions (e.g., convexity), O(T) is possible.
Practical Implementation with LLMs (Algorithm 2, Figure 1 & 4):
This version uses an LLM for several components:
- πLLM: An LLM that generates a thought (hypothesis η) and a corresponding action A, given interaction history.
- πref: A reference policy (e.g., random valid actions) for exploration diversity.
- RLLM(η,A): An LLM-based reward mapping that scores action A under hypothesis η.
Steps:
- Sample Hypotheses and Actions:
- Use πLLM to generate N diverse thought-action pairs (H^t,initial actions).
- Use πref to generate M additional random/exploratory actions. Let A^t be the set of all N+M actions.
- Thought Cross-Verify:
- Construct a score matrix [St]η,a=RLLM(η,a) for all η∈H^t,a∈A^t.
- Exploitation Step:
- Find A^t∗=η∈H^t⋂argamax[St]η,a.
- If A^t∗=∅ (consensus optimal action exists), choose At+1 from A^t∗ (e.g., via tie-breaking).
- Exploration Step (if no consensus):
- Optimistic Hypothesis Selection: Keep hypotheses H~t⊆H^t whose best actions achieve the highest scores: H~t=argη∈H^tmax(amax[St]η,a).
- Action Selection with Re-scoring (Advantage): Select At+1=arga∈A^tmax(η∈H~tmax([St]η,a−Ea~∼πref[[St]η,a~])). This re-centers scores by subtracting the average score of actions from πref under each hypothesis, favoring actions with high advantage.
- Tie-breaking prefers earlier generated hypotheses/actions.
Experiments
- Environments: Wordle (modified feedback: only first incorrect char), Battleship (2D grid, locate 3 ships), Minesweeper (2D grid, avoid mines).
- Baselines:
- Greedy: ReAct-style, generates one hypothesis and one action.
- HELiX (No exploitation step): Only optimistic exploration.
- HELiX (No πref): Omits re-scoring with a reference policy.
- LLM Used: Claude-Sonnet-3.5 v2.
- Results (Figure 2):
- HELiX generally outperformed baselines, especially in Battleship and Minesweeper where strategic exploration is crucial.
- The full HELiX (with exploitation and πref) performed best.
- Greedy LLM performed worse, highlighting the need for structured exploration.
- Implementation Discussion: Relies on LLM's ability to:
- Select optimal actions under a given hypothesis.
- Produce fair scores across hypotheses.
- Generate diverse and faithful hypotheses reflecting interaction history.
- These are strong assumptions needing further validation.
Relationship to Existing Paradigms (Figure 3)
LLF is presented as a general framework subsuming:
- Reinforcement Learning (RL): Feedback is the scalar reward.
- Partial Monitoring: Agent observes abstract feedback; LLF handles unknown feedback mappings.
- Interaction-Guided Learning (IGL): Rich feedback vector decodes latent reward; LLF models this via rη and fη.
- Reward-Informative LLF: Latent reward is a function of observed (language) feedback.
- Multi-Objective RL (MORL): Hypotheses can represent vector-valued rewards.
- Preference-Based RL (PbRL): Feedback is pairwise preference; fη can be a comparator.
- Imitation Learning (IL): Expert actions are feedback; verifier loss measures deviation.
Limitations and Open Questions
- Transfer eluder dimension is not a lower bound: A problem might have unbounded TE but be trivially solvable (e.g., feedback always reveals an optimal action, but not full reward details). HELiX's exploitation step handles this, unlike naive UCB.
- The true complexity of LLF might lie between worst-case reward identification and optimal behavior learning.
- Developing a complexity measure that both lower-bounds regret and informs practical LLM-based algorithm design is an open question (e.g., adapting Decision-Estimation Coefficient (DEC) to LLF).
In summary, this paper provides a significant first step towards a principled understanding of learning from general language feedback. It formalizes the problem, introduces a relevant complexity measure (transfer eluder dimension), proposes a provably efficient algorithm (HELiX), and demonstrates its practical viability with LLMs. The work highlights how informative language feedback can be substantially more powerful than scalar rewards and lays theoretical groundwork for designing more sophisticated LLM agents that learn interactively.
Follow-up Questions
We haven't generated follow-up questions for this paper yet.