Papers
Topics
Authors
Recent
Search
2000 character limit reached

Example-Guided Reinforcement Learning

Updated 10 February 2026
  • Example-guided reinforcement learning is a methodology that leverages external examples—such as trajectories, temporal logic specifications, and language model outputs—to guide agents in tackling complex, reward-sparse tasks.
  • It integrates diverse forms of guidance to overcome challenges in manual reward engineering and autonomous exploration, thereby enhancing sample efficiency and interpretability.
  • Empirical studies in robotics, locomotion, and continual adaptation demonstrate significant improvements in convergence speed, robustness, and hierarchical task decomposition.

Example-Guided Reinforcement Learning (EGRL) encompasses a set of methodologies in which reinforcement learning (RL) agents are guided by external examples—such as successful trajectories, demonstrations, temporal logic specifications, or policy suggestions from LLMs—to facilitate or accelerate policy acquisition for complex, temporally-structured, or reward-sparse tasks. EGRL serves as an effective alternative or supplement to manual reward engineering, imitation learning in the strict sense, and purely autonomous exploration, enabling sample-efficient, interpretable, and robust solutions across domains including robotics, dynamic environments, and sequential decision making.

1. Formal Problem Settings and General Methodological Variants

EGRL methods are instantiated over a Markov Decision Process M=(S,A,p,r,γ)\mathcal{M}=(\mathcal{S},\mathcal{A},p,r,\gamma), with state space S\mathcal{S}, action space A\mathcal{A}, environment dynamics p(s,a)p(\cdot|s,a) (potentially unknown), and reward r(s,a,s)r(s,a,s') (which may be sparse, hard to specify, or even replaced with outcome examples). The agent’s goal is to learn a policy π\pi that maximizes the expected sum of discounted rewards or, in the absence of explicit rewards, maximizes the likelihood of achieving a set of success examples E={s}\mathcal{E}=\{s^*\} or satisfying a formal task specification.

Variants of EGRL arise by the type and role of the example:

  • Demonstration-guided RL: The agent is provided trajectories collected by a human or optimized controller, often leveraged for bootstrapping or curriculum (e.g., behavior cloning and fine-tuning) (Pertsch et al., 2021, Apostolides et al., 20 Mar 2025).
  • Temporal logic/task-structure-guided RL: The agent is equipped with formally-specified goals (such as temporal logic formulas) that are converted into reward signals or automata-augmented states (Li et al., 2018).
  • Outcome-based RL: The agent is supplied with states exemplifying desirable terminal conditions, and learns to maximize the discounted probability of future success, bypassing explicit reward learning (Eysenbach et al., 2021).
  • Language/model-based example guidance: Examples are generated in-situ, e.g., by querying a LLM for plausible action sequences or completions, and are mixed with the agent’s own policy (Karimpanal et al., 2023).

These variants support learning under sparse, delayed, or ambiguous reward conditions, hierarchical or compositional task structure, and settings where explicit reward/goal specification is infeasible.

2. Algorithmic Design: Incorporation of Examples

The operationalization of example guidance varies with example type:

  • Demonstration or trajectory example: Trajectories are incorporated via behavior cloning to initialize the policy (minimizing kπθ(sk)ak2\sum_k \|\pi_\theta(s_k) - a_k\|^2), followed by fine-tuning with RL objectives. In demonstration-guided continual RL, a dynamic demonstration repository is maintained, where each task leverages the most relevant demonstration, and new high-performing trajectories are added as guide examples (Yang et al., 21 Dec 2025). For skill-based demonstration guidance, reusable skill embeddings are inferred from large datasets and high-level policies select skill latents, guided by a demonstration-conditioned skill posterior (Pertsch et al., 2021).
  • Temporal logic or automaton example: Temporal goals specified as syntactically co-safe TLTL formulas are converted into deterministic finite automata. The automaton state qq augments the agent’s observation, imposing hierarchical task decomposition, and intrinsic rewards are generated automatically based on automaton transitions (Li et al., 2018).
  • Success outcome examples: Successful outcome states E\mathcal{E} are used in place of rewards, and the agent trains a classifier Cθ(s,a)C_\theta(s,a) representing the discounted future success probability, with updates derived from a data-driven Bellman equation involving recursive classification (Eysenbach et al., 2021).
  • LLM or programmatic examples: Example actions are provided online via mixed policies, blending the agent’s own action distribution with the example policy from a LLM. The proportion of reliance on the example policy is controlled (e.g., by a query-deciding secondary RL agent) to balance efficiency and computational/resource costs (Karimpanal et al., 2023).

3. Architectural and Representation Innovations

Crucial architectural features emerge in EGRL:

  • Hierarchical and interpretable structure: Temporal-logic- or automaton-guided RL introduces an automaton state qq that decomposes the global policy into sub-policies aligned with sub-tasks or temporal progress. This separation entails a “manager-worker” abstraction, with high-level symbolic control modulating lower-level continuous policy (Li et al., 2018).
  • Skill representation: Skill-based EGRL models define skills as HH-step temporal segments, encode them into a latent variable zRdz\in\mathbb{R}^d, and factorize the agent into high-level selection and low-level skill execution policies. Skill priors and posteriors enable flexible adaptation and reuse of previously learned behaviors (Pertsch et al., 2021).
  • Network structures: Policies are typically realized via deep neural networks adapted to the input (state, automaton state, skill latent), with additional discriminators for demonstration support detection, or multiple value/policy heads for multi-skill integration (Peng et al., 2018, Pertsch et al., 2021).
  • Demonstration repositories: In continual RL, an evolving repository enables dynamic retrieval and expansion of guide examples, facilitating adaptation and knowledge retention without catastrophic forgetting (Yang et al., 21 Dec 2025).

4. Training Protocols and Curriculum

EGRL typically employs a two- or multi-stage training schedule tailored to the nature of example guidance:

  • Imitation/pre-training: Policy initialization through behavioral cloning using demonstration data or imitation objectives relative to reference trajectories (Peng et al., 2018, Apostolides et al., 20 Mar 2025, Pertsch et al., 2021).
  • Example-guided exploration/curriculum: Rollout policies blend example guidance and self-exploration, with the reliance on examples decayed according to a curriculum schedule (e.g., buffer-mixing or staged roll-in). Curriculum coefficients (e.g., αt\alpha_t) control the transition from fully guided to autonomous RL (Yang et al., 21 Dec 2025).
  • Fine-tuning with environment reward and/or intrinsic objectives: RL fine-tuning may utilize intrinsic rewards derived from automata, adversarial imitation rewards from discriminators, or the environment reward itself (Li et al., 2018, Pertsch et al., 2021). Additional objectives include KL-divergence terms regularizing to skill priors/posteriors, or the “success classifier” for outcome-based EGRL (Eysenbach et al., 2021).
  • Progressive task complexity: Progressive training schemes reduce reliance on demonstration (imitation reward weights decreased, randomization increased) and advance from demonstration-regime, to generalization, to robustness stages (Apostolides et al., 20 Mar 2025).

5. Theoretical Guarantees and Interpretability

Several EGRL frameworks provide formal guarantees or data-driven Bellman equations reflecting the underlying control objective:

  • Automata-guided reward alignment: For TL-guided RL, maximizing the discounted intrinsic automaton-based reward is equivalent to maximizing the probability of achieving the temporal-logic specification within the horizon; the discount factor favors minimal-step solutions (Li et al., 2018).
  • Recursive classification value iteration: Example-based RL via recursive classification defines a data-driven Bellman recursion for the success-probability classifier. Under tabular settings, this procedure provably converges to the optimal policy with monotonic improvement (Eysenbach et al., 2021).
  • Forward transfer and forgetting: Demonstration-guided continual RL is evaluated for forward transfer, forgetting, and overall average performance, with demonstration guidance yielding superior area-under-curve (AUC), rapid convergence, and negligible forgetfulness across non-stationary task sequences (Yang et al., 21 Dec 2025).
  • Policy transparency: Automaton-augmented policies, skill-selection models, and repository-based approaches yield interpretable execution traces, as the sequence of automaton or skill states directly encodes subtask completion or knowledge utilization (Li et al., 2018, Pertsch et al., 2021).

6. Empirical Results and Application Domains

EGRL has been empirically validated in diverse benchmarks and robotic platforms:

  • Robotic manipulation: FSA-guided RL with demonstration achieves >>90% success on temporally complex sequential tasks, requiring only formal task specification and minimal reward shaping (Li et al., 2018). Skill-based demonstration guidance in long-horizon kitchen manipulation and office cleanup yields $2$–5×5\times improvements in sample efficiency over baselines (Pertsch et al., 2021).
  • Locomotion and compliant control: Example-guided policy search from trajectory-optimized examples enables robust, energy-efficient jumping in quadrupeds under variable terrain and robot morphologies, with sample efficiency and generalization to perturbations (Apostolides et al., 20 Mar 2025).
  • Continual adaptation: Demonstration-guided continual RL in dynamic navigation and MuJoCo locomotion tasks produces consistent forward transfer, strong performance (++0.80 FT in Hopper), rapid convergence ($2$–5×5\times speedup), and virtually zero catastrophic forgetting (Yang et al., 21 Dec 2025).
  • Pattern completion and multimodal control: Language-guided example querying accelerates RL on cube-stacking, image completion, and manipulation with up to 3×3\times reduction in total sample requirements, and $30$–60%60\% reduction in teacher/oracle queries (Karimpanal et al., 2023).
  • Physics-based animation: Motion-clip imitation fused with task objectives results in humanoids, robots, and fantastical characters performing acrobatic, multi-modal behaviors, robust to perturbations, environmental change, and task composition (Peng et al., 2018).

7. Limitations and Extensibility

Limitations of EGRL approaches include:

  • Restricted coverage of example priors: Skill posterior guidance can fail outside the support of demonstrations, requiring discriminators to gate off-policy samples (Pertsch et al., 2021).
  • Noisy or brittle demonstration signals: Discriminator/imitator rewards can be unstable, and reliance on hand-tuned examples may introduce bias or insufficient generalization.
  • Manual burden in example or prompt generation: For language-based or formal-task guidance, carefully crafted examples or prompts may be essential; pattern coverage by LLMs is limited to semantically regular tasks (Karimpanal et al., 2023).
  • Absence of reward engineering may limit fine-grained behavior: Outcome-based or automaton-based intrinsic objectives remove manual reward specification, but may lack granularity or fail to resolve ambiguities in complex settings (Eysenbach et al., 2021).
  • Partial ability to jointly optimize all layers of hierarchy: Fixing low-level skill decoders may limit adaptability for novel compositions or unseen dynamic regimes (Pertsch et al., 2021).

Active research explores the joint finetuning of hierarchies, extension to visual/partially observed domains, zero-shot generalization via goal-conditioned skills, model-based skill sequencing, and scalable prompt- or example-selection heuristics.


These dimensions form the technical landscape of Example-Guided Reinforcement Learning, with rigorous algorithmic, theoretical, and empirical validations covering demonstration-guided curriculum learning (Yang et al., 21 Dec 2025), hierarchical temporal logic guidance (Li et al., 2018), outcome-based Bellman recursion (Eysenbach et al., 2021), skill compositionality (Pertsch et al., 2021), and diverse embodiment/interaction platforms (Peng et al., 2018, Apostolides et al., 20 Mar 2025, Karimpanal et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Example-Guided Reinforcement Learning.