LEARN-Opt: LLM Reward Function Optimization

Updated 1 December 2025

LEARN-Opt is a modular system for automated reward function design in RL, using LLMs to generate reward candidates directly from natural language descriptions.
It integrates code synthesis, unsupervised metric derivation, preference-based feedback, and multi-objective optimization to streamline reward design without human-crafted criteria.
Experimental validations across continuous control, robotics, and multi-objective settings demonstrate its capability to match or outperform traditional manually designed rewards.

LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization) is a modular, end-to-end system for autonomously generating, evaluating, and refining reward functions in reinforcement learning (RL). It leverages LLMs to create reward function candidates and appropriate evaluation metrics directly from high-level natural language task descriptions, removing the traditional reliance on hand-crafted evaluation criteria, preference labels, or access to environment source code. LEARN-Opt integrates code synthesis, unsupervised or preference-based feedback, automated metric derivation, and multi-objective optimization strategies to deliver robust reward design pipelines across various RL settings, including custom continuous control, multi-objective environments, robotics, and preference-based RL.

1. Formal Problem Statement and Mathematical Framework

LEARN-Opt addresses the problem of automated reward function design in RL, automating the traditionally manual process of mapping user requirements or task objectives to RL reward signals. The environment is modeled as a Markov Decision Process (MDP), $\mathcal{M} = (S, A, P, \gamma)$ , with a state space $S$ , action space $A$ , transition kernel $P(s'|s, a)$ , and discount factor $\gamma$ . For multi-objective settings, $m$ user or task requirements are individually converted into reward components $r_i: S \times A \to \mathbb{R}$ , leading to a composite scalar reward function:

$R(s, a; w) = \sum_{i=1}^m w_i r_i(s,a), \quad w \in \mathbb{R}_+^m$

The agent’s optimization objective is formulated as:

$\pi^* = \arg\max_\pi \mathbb{E}_{s,a \sim \pi}\left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t; w)\right]$

subject to potential constraints on reward composition or task metrics, e.g., $C_j(w) \leq 0$ for $j=1,\ldots,k$ , where $C_j$ encodes requirements such as $\mathbb{E}_\pi[f_j(s, a)] \geq T_j$ or $\mathbb{P}_\pi[$ collision $]=0$ (Xie et al., 4 Sep 2024).

Alternatively, LEARN-Opt is instantiated as a bilevel optimization in settings where the reward function parameters $\theta$ are meta-learned to induce policies maximizing a user- or environment-defined utility:

$\theta^* = \arg\max_{\theta\in\Theta} U(\phi^*(\theta)), \quad \phi^*(\theta) = \arg\max_\phi \mathbb{E}_{\pi_\phi, P}\left[\sum_{t=0}^T \gamma^t R_\theta(s_t, a_t)\right]$

with meta-objective $U$ typically operationalized as true environment success (possibly defined via a sparse success signal) (Li et al., 2023).

2. Architecture and Core Modules

LEARN-Opt decomposes into several interacting modules; variant architectures emphasize different aspects depending on the target RL setting:

Reward Component or Candidate Generator: Synthesizes reward function candidates (either as weighted sums or as holistic code stubs) from environment/task descriptions using LLM prompting. For multi-objective or modular rewards, individual components $r_i$ are generated for each requirement via contextually-specific prompts (Xie et al., 4 Sep 2024). For black-box settings, an LLM-based mapping agent extracts variable names and constructs vector-to-identifier mappings, and a "generation agent" synthesizes Python code for $n$ reward candidates at each iteration (Cardenoso et al., 24 Nov 2025).
Code Critic or Sanity Checker: Verifies code for syntactic correctness and semantic/functional appropriateness, typically via unit tests and explicit LLM-guided debugging prompts. This mechanism isolates and corrects ambiguous or incorrect reward code (Xie et al., 4 Sep 2024, Li et al., 2023).
Weight/Parameter Optimizer: For additive reward formulations, LEARN-Opt initializes component weights using LLM-estimated reward statistics and iteratively adjusts $w$ by guided mutations and crossovers (akin to genetic algorithms), with step sizes informed by the degree of constraint satisfaction or log-derived feedback (Xie et al., 4 Sep 2024).
Training Log Analyzer / Unsupervised Metric Deriver: After candidate execution, this module extracts key empirical metrics (e.g., collision counts, average delays) and compares them to threshold criteria, suggesting targeted weight updates ( $\Delta_j = (M_j - T_j) / |T_j|$ ) (Xie et al., 4 Sep 2024). Other instantiations synthesize evaluation metrics directly from text via "planner" and "coder" LLMs, which generate both metrics $E_i$ and ranking functions $R$ for candidate assessment (Cardenoso et al., 24 Nov 2025).
Multi-Run/Variance Control Loop: Given the high variance in stochastic reward search, LEARN-Opt employs multi-run pipelines, executing the candidate generation–evaluation–selection sequence $N$ times with varied LLM sampling temperatures. The system reports both "best-of- $N$ " and "average-of- $N$ " test performances to ensure robust policy selection (Cardenoso et al., 24 Nov 2025).
Preference-Based Modules: In settings lacking direct reward supervision, LEARN-Opt can integrate LLM-based preference ranking of trajectories. Here, LLMs compare trajectory summaries or abstractions, providing training data for learning reward predictors that are then used as shaping functions in downstream RL (Shen et al., 28 Jun 2024, Sun et al., 18 Oct 2024).

3. Unsupervised Metric Derivation and Evaluation Strategies

LEARN-Opt’s principal innovation compared to prior work is the derivation of performance evaluation criteria directly from task descriptions, eliminating the need for human-specified metrics or internal simulator access. The unsupervised evaluation loop comprises:

Text-to-Metric Synthesis: A "planner" LLM reasons from task descriptions about suitable numerical metrics (e.g., "minimize distance to [0,0,0]"), and a companion LLM generates corresponding code (e.g., mean position distance function) (Cardenoso et al., 24 Nov 2025).
Council/Ensemble Analyzer Architecture: Multiple independent LLM-based analyzers are instantiated, each producing metrics and candidate ranking methods. Majority voting across this council is used to robustly select the best candidate reward function (Cardenoso et al., 24 Nov 2025).
Preference-Based Evaluators: For RL tasks where only high-level success/failure or human preferences are available, LLMs rank or rate agent-generated trajectories. These ranked preferences are used to learn reward predictors $R_\theta$ via cross-entropy losses aligned to the LLM-predicted preference ordering (Shen et al., 28 Jun 2024, Sun et al., 18 Oct 2024).
Trajectory Preference Evaluation (TPE): An order-preservation criterion is established, requiring that the lowest-scoring successful trajectory (by per-step return) exceeds the highest-scoring failed trajectory under the candidate reward function. Fractional satisfaction of this criterion ( $\alpha$ ) guides iterative candidate refinement (Sun et al., 18 Oct 2024).

4. Experimental Validation and Performance

LEARN-Opt has been validated across a spectrum of RL benchmarks, including continuous control (IsaacLab Cartpole, Quadcopter, Ant, Humanoid, Franka-Cabinet), discrete MiniGrid tasks, manipulation (Meta-World, ManiSkill2), and game domains:

Benchmark Domain	Baseline	Best-of-N GP	Findings and Highlights
IsaacLab continuous tasks	EUREKA	+0.15–+0.80	Outperforms or matches EUREKA for Cartpole, Ant, Franka; comparable on others
MiniGrid (sparse RL)	PPO	+45–+70 pts	LEARN-Opt reaches 90% success in <50k steps (vs. 120k for PPO), handles constraints
Meta-World, ManiSkill2	Human	85%	CARD/LEARN-Opt outperform or match oracle on 10/12 tasks, far fewer LLM tokens
Robotics (Isaac Sim)	Manual	85–100%	LLM-refined reward functions match or exceed manually designed ones

Best-of-N statistics consistently outperform mean-of-N ("average" GP is slightly negative), demonstrating the necessity of multiple independent search runs due to inherent reward search variance (Cardenoso et al., 24 Nov 2025). Line-by-line empirical metrics, role-specific ablations (e.g., removal of council modules), and performance under small (e.g., gpt-4.1-nano) vs. large LLMs are all reported in the original studies.

5. Prompting Strategies and Design Principles

Prompt engineering in LEARN-Opt is highly modular and self-contained. Key design patterns include:

Explicit, Numerically-Grounded Prompts: Instructions always specify API field access, reward sign, and return type to reduce ambiguity and ensure verifiability (Xie et al., 4 Sep 2024).
Separate System/User Roles: System provides environment and API context; user prompts solicit reward code or corrections for specific components.
Feedback Summarization: Instead of pasting full logs, concise metric deviations (e.g., "component 2 is 35% above target") keep contexts small and actionable.
Preference Ranking: LLMs are always queried with clear options (e.g., "If first is better return 0, else 1, if equal 2"), and loss functions encode softmax or cross-entropy over these discrete choices (Shen et al., 28 Jun 2024).
Mutational Search and Refinement: Iterative weight mutation steps (±10%–100%), directed crossovers, and explicit per-metric scaling suggestions drive efficient convergence (Xie et al., 4 Sep 2024).

LLM temperature and sampling hyperparameters are specified (e.g., temperature=0.5, top_p=1.0) to balance creativity and reproducibility (Xie et al., 4 Sep 2024). The prompt structure explicitly accommodates task-specific adaptation, enabling domain transfer across robotics, games, and simulated control environments (Cardenoso et al., 24 Nov 2025, Li et al., 2023).

6. Strengths, Limitations, and Future Directions

Strengths:

Fully autonomous reward synthesis and evaluation from natural language only, with no need for preliminary evaluation metrics, environment source, or human-in-the-loop feedback (Cardenoso et al., 24 Nov 2025).
Model-agnostic: applicable with both large and "lite" LLMs; council ensemble architectures permit effective deployment on smaller, cheaper LLMs (Cardenoso et al., 24 Nov 2025).
Efficient convergence in both weight-balanced and severely misbalanced initialization settings; recoverability from extreme initial weight assignments in 5–6 iterations (Xie et al., 4 Sep 2024).

Limitations:

Performance may degrade on extremely high-dimensional or complex dynamical systems, particularly where zero-shot metric derivation is ambiguous (Cardenoso et al., 24 Nov 2025).
LLM hallucination remains a risk for both candidate reward code and evaluation metric synthesis; partially mitigated by code sanity checks and council voting (Cardenoso et al., 24 Nov 2025).
High variance in candidate quality; average candidate may fail, requiring "multi-run" best-of-N search (Cardenoso et al., 24 Nov 2025).
Some bias or instability in LLM-guided preference ranking, particularly under ambiguous trajectory summaries or domain shift (Shen et al., 28 Jun 2024).

Future Directions:

Integration with vision-LLMs to provide richer, multimodal evaluation and reward shaping.
Hierarchical, curriculum-based reward decomposition for extremely complex or long-horizon tasks.
Bayesian optimization or active learning for prompt tuning and to minimize required LLM queries (Cardenoso et al., 24 Nov 2025).
Inclusion of structured expert constraints and safety analysis directly into the reward design loop (Sun et al., 18 Oct 2024).

7. Comparative Analysis with Prior Work

LEARN-Opt generalizes and extends recently proposed LLM-based reward design frameworks such as EUREKA, Auto MC-Reward, OCALM, and CARD:

System	Input	Metric Derivation	Feedback/Refinement	Black-box Eval.	Human-in-Loop
LEARN-Opt (Cardenoso et al., 24 Nov 2025, Xie et al., 4 Sep 2024)	Natural language task, no source code	LLM synthesizes metrics/code	LLM analytic council, iterative search	Yes	No
CARD (Sun et al., 18 Oct 2024)	Python-style abstraction + NL	Offline TPE, sub-rewards	RL+TPE skip, preference prompt	Yes	No
OCALM (Kaufmann et al., 24 Jun 2024)	Object-centric abstraction	Direct via LLM, relational	Symbolic program, interpretable	No	No
Auto MC-Reward (Li et al., 2023)	Task description + obs schema	RL trajectory analyzer	LLM reviewer/analyzer	Yes	No
Preference LLM4PG (Shen et al., 28 Jun 2024)	Env/task + constraints (NL)	LLM preference ranking	Reward learned to match preferences	Yes	No

By unifying code generation, metric synthesis, iterative optimization, and council-based evaluation in a general framework, LEARN-Opt represents the current state-of-the-art in fully autonomous, scalable reward engineering for RL—from custom continuous control to complex multi-objective or black-box-game environments.

References: (Xie et al., 4 Sep 2024, Cardenoso et al., 24 Nov 2025, Shen et al., 28 Jun 2024, Kaufmann et al., 24 Jun 2024, Li et al., 2023, Sun et al., 18 Oct 2024).