Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities (2504.16078v1)

Published 22 Apr 2025 in cs.LG and cs.AI

Abstract: The success of LLMs has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as $\epsilon$-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Thomas Schmied (9 papers)
  2. Jörg Bornschein (8 papers)
  3. Jordi Grau-Moya (25 papers)
  4. Markus Wulfmeier (46 papers)
  5. Razvan Pascanu (138 papers)

Summary

This paper investigates why LLMs often perform suboptimally in decision-making tasks despite their potential, stemming from pre-trained knowledge and reasoning capabilities like Chain-of-Thought (CoT) (Wei et al., 2022 ). The authors identify and systematically paper three prevalent failure modes in small-to-medium scale LLMs (Gemma2 2B, 9B, 27B):

  1. Greediness: LLMs tend to prematurely commit to the best-performing action seen so far, even if only a small portion of the action space has been explored. This leads to stagnating action coverage (up to 55% of actions unexplored in multi-armed bandits (MABs)) and suboptimal cumulative regret. Larger models and CoT reasoning help but do not fully resolve this.
  2. Frequency Bias: Smaller LLMs (e.g., 2B) often copy the most frequent action present in their input context history, regardless of its associated reward. Larger models (e.g., 27B) largely overcome this bias but remain prone to greediness. This bias is suspected to be an artifact of supervised pre-training.
  3. Knowing-Doing Gap: LLMs can often correctly reason about or describe the optimal strategy (e.g., generate a correct CoT rationale for the UCB algorithm, 87% correct in experiments) but fail to translate this knowledge into corresponding actions (e.g., selecting a greedy action 58% of the time even with a correct rationale).

To mitigate these shortcomings, the paper proposes Reinforcement Learning Fine-Tuning (RLFT) using self-generated CoT rationales. The core idea is to fine-tune the LLM policy πθ\pi_{\theta} using environment rewards obtained from interaction.

RLFT Implementation:

  • Context: The model input ctc_t concatenates task-specific instructions (ctinc_t^{in}), output format instructions (ctoutc_t^{out}), and recent interaction history (ctτtC:tc_t^{\tau_{t-C:t}} including states, actions, rewards).
  • Action Generation: The model generates a sequence ztz_t containing both CoT reasoning (ztCoTz_t^{CoT}) and the actual action ata_t. A parsing function g(zt)g(z_t) (using regular expressions) extracts ata_t. A permissive output template (e.g., ACTION=X) is used. A generation budget GG (default 256 tokens) limits the length of ztz_t.
  • Reward Shaping: Besides the environment reward rtenvr_t^{env}, a penalty rtvalidr_t^{valid} (e.g., -5) is applied if g(zt)g(z_t) fails to extract a valid action from the generated sequence. Environment rewards are normalized.
  • Objective: The fine-tuning uses a PPO-style (Schulman et al., 2017 ) clipping objective with a KL divergence penalty against a reference policy πref\pi_{ref} (the frozen pre-trained model) to maintain stability:

    L=min(πθ(zc)πθold(zc)Aadv,clipϵ(πθ(zc)πθold(zc))Aadv)βDKL(πθ(c)πref(c))L = \min\left(\frac{\pi_\theta(z|c)}{\pi_{\theta_{old}}(z|c)}A_{adv}, \text{clip}_{\epsilon}\left(\frac{\pi_\theta(z|c)}{\pi_{\theta_{old}}(z|c)}\right)A_{adv}\right) - \beta D_{KL}(\pi_\theta(\cdot|c)||\pi_{ref}(\cdot|c))

    • Advantage estimation AadvA_{adv} uses Monte Carlo returns (rewards-to-go) for fixed-horizon tasks (bandits) and Generalized Advantage Estimation (GAE) (Schulman et al., 2015 ) with a learned value head for variable-length tasks (Tic-tac-toe).

Experiments & Findings:

  • Environments: Gaussian/Bernoulli MABs (5, 10, 20 arms), contextual bandits (MovieLens), and text-based Tic-tac-toe (Ruoss et al., 2 Dec 2024 ).
  • RLFT Effectiveness: RLFT significantly improves decision-making performance across environments and model sizes (Gemma2 2B, 9B).
    • It lowers cumulative regret compared to in-context learning (ICL) baselines.
    • It mitigates greediness by increasing action coverage (+12% for 2B on 10-arm MABs).
    • It counteracts frequency bias, reducing the selection of frequent suboptimal actions, although the bias isn't entirely eliminated at high repetition counts.
  • Exploration Mechanisms: While RLFT improves exploration, it remains suboptimal compared to specialized algorithms like UCB. Various mechanisms were tested:
    • Try-all: Initial exploration of all arms (like UCB) yielded significant gains, suggesting LLMs perform well if given sufficient information but struggle with exploration itself.
    • Exploration Bonus: A simple reward shaping (+1 reward for untried actions during RLFT) significantly improved exploration and reduced regret, highlighting the importance of explicit rewards for desired behaviors.
    • Other methods (ϵ\epsilon-greedy, self-consistency (Wang et al., 2022 ), self-correction (Kumar et al., 19 Sep 2024 )) showed varied effects.
  • Ablations:
    • Tic-tac-toe: RLFT substantially increased win rates against random and MCTS opponents, demonstrating effectiveness in stateful environments. Providing legal actions in the prompt was crucial for high performance.
    • Importance of CoT: RLFT without CoT performed poorly, barely matching ICL with CoT, confirming CoT's role as a vital mechanism for exploration and rationalization during RLFT.
    • Supervised Fine-Tuning (SFT): SFT on expert UCB trajectories (Behavior Cloning - actions only; Thought Cloning - actions + CoT) achieved near-expert performance, showing the effectiveness of expert data when available.
    • "Thinking" Time: Increasing the generation budget GG (e.g., from 256 to 512 tokens) improved performance, allowing the model more "time" to rationalize, but significantly increased computational cost due to longer rollouts in multi-step decision tasks.

Conclusion: The paper demonstrates that LLMs exhibit systematic failures (greediness, frequency bias, knowing-doing gap) in decision-making. RLFT on self-generated CoT rationales effectively mitigates these issues, enhancing exploration and overall performance. However, LLM exploration remains a challenge, often requiring explicit mechanisms or reward shaping for near-optimal behavior. The work underscores the importance of CoT and sufficient generation budget ("thinking time") for RLFT in decision-making contexts.

Youtube Logo Streamline Icon: https://streamlinehq.com