IB-GRPO: Pareto-Optimal Learning Path Framework
- IB-GRPO is a multi-objective learning path recommendation framework that uses vector-valued rewards to optimize pedagogical objectives such as learning effect and ZPD alignment.
- The framework integrates genetic algorithms and offline RL for expert warm-start, followed by indicator-based policy optimization without manual scalarization.
- Empirical evaluations on ASSIST09 and Junyi datasets demonstrate its superior balance in optimizing learning effect, path diversity, and operational constraints.
IB-GRPO (Indicator-Based Group Relative Policy Optimization) is a learning path recommendation (LPR) framework that aligns LLM-based policies with pedagogical objectives such as learning effect maximization, @@@@1@@@@ (ZPD) alignment, operational constraints, and path diversity, by leveraging a vector-valued reward structure and direct Pareto frontier optimization without manual scalarization (Wang et al., 21 Jan 2026).
1. Motivation and Pedagogical Objectives
Long-horizon LPR requires generating sequences of learning items personalized to individual students in order to:
- Maximize long-term learning effect: Enhance a student’s post-instruction proficiency.
- Schedule exercise difficulty to match the ZPD: Adjust item difficulty to maintain challenge within a beneficial proficiency band.
- Respect operational constraints: Adhere to desired path lengths and session constraints.
- Maintain diversity: Avoid repetitive recommendations by supporting a wide range of plausible learning trajectories.
Standard LLMs, trained to optimize next-token likelihood, exhibit myopic planning and do not natively conform to these multi-faceted pedagogical needs. Simple scalar reduction of multi-objective signals risks “locking in” suboptimal trade-offs. IB-GRPO introduces a vector reward formulation and directly targets the Pareto-optimal frontier, sidestepping ad hoc weighting and promoting adaptive balancing of objectives.
2. Formal Framework and Multi-Objective MDP
The LPR setting is formalized as an episodic Markov Decision Process (MDP):
- State (): Encapsulates the student's latent proficiency (), interaction history , and prompt features (e.g., target path length, prior recommendations).
- Action (): Choosing the next knowledge concept or exercise from a candidate set.
- Trajectory (): An ordered sequence generated auto-regressively by the parameterized LLM policy .
The reward function is a vector comprising:
- Learning Effect :
where are pre/post-test scores and the maximum.
- ZPD Alignment :
is the item difficulty; the optimal-difficulty center for proficiency .
- Length Constraint :
where .
- Diversity :
is the sampled path group, and similarity is measured over -gram overlaps.
The learning objective is to train so that non-dominated trajectories sampled from the policy closely approximate the true Pareto frontier of the vectorized reward, without scalar collapse.
3. IB-GRPO Algorithmic Procedure
IB-GRPO operates in two distinct stages: a hybrid expert warm-start via supervised fine-tuning (SFT), followed by indicator-based group relative policy optimization.
3.1 Stage I: Hybrid Expert Warm-Start
- Genetic Algorithm Expert: Each trajectory is encoded as a chromosome; tournament selection, crossover, and mutation drive exploration of the learning effect reward with diversity.
- Policy-based Teacher (Offline RL): Pre-trained LPR agents (e.g., CSEAL, GEPKSD) complement genetic search, especially in sparse or low-performing regions.
- Behavior Cloning SFT: Expert trajectories from both sources comprise dataset . The LLM is warm-started to maximize the log-likelihood:
- Outcome: Warm-started policy .
3.2 Stage II: Indicator-Based Group Relative Policy Optimization
Training Iteration Procedure:
- Group Sampling: For each state in the batch, sample trajectories using the current policy.
- Reward Vectorization: Compute reward vector for each trajectory.
- Dominance Indicator: For pairs ,
quantifies the minimum uniformly needed augmentation for to weakly dominate .
- Pareto Fitness :
with scaling parameter .
- Group-Relative Advantage : Standardize within the group
- Policy Update: Optimize using importance sampling and asymmetric clipping
, with , .
This pairwise dominance approach facilitates direct discovery of Pareto-optimal policies and circumvents the drawbacks of manual scalarization.
4. Training Workflow and Pseudocode
The core training protocol can be summarized as follows:
| Stage | Key Steps |
|---|---|
| Warm-start SFT | Generate expert trajectories (GA, RL agents); Aggregate dataset ; Fine-tune LLM via cross-entropy loss on experts. |
| IB-GRPO Optimization | For epochs: sample state batches; for each state, sample trajectories, compute vector rewards, all pairwise , , ; update on ; update reference policy. |
This approach synthesizes search-based and offline RL-generated data for broad solution coverage, then proceeds with indicator-based policy optimization.
5. Empirical Evaluation
Experiments are conducted on the ASSIST09 and Junyi datasets with a DKT-based Knowledge Evolution Simulator (KES). Key experimental characteristics:
- Datasets:
- ASSIST09: 167 concepts, 4,217 learners, 346,860 records.
- Junyi: 835 concepts, 525,061 learners, 21,460,249 records.
- Simulator: KES (DKT environment) for counterfactual student response simulation.
- Model backbone: Qwen2.5-7B LLM (8 × A800-40GB GPUs).
- Key hyperparameters: Group size , indicator scale tuned per validation, clipping .
Baselines:
- Non-RL: DKTRec
- General RL: DQN, Actor-Critic, PPO
- Education-specific RL: CSEAL, GEPKSD
- LLM-based: GenAL, ReAL
Metrics:
- Learning Effect (): (higher is better)
- LenScore:
- Path Diversity ():
- ZPD alignment ()
Main Results:
| Dataset | Baseline (L=20) | IB-GRPO (L=20) |
|---|---|---|
| Junyi | ReAL: 0.5724 | 0.7743 |
| ASSIST09 | GEPKSD: 0.5837 | 0.5911 |
IB-GRPO surpasses all baselines across metrics and path lengths. Ablations reveal that removal of ZPD rewards degrades long-horizon planning, and alternatives to such as HVO or GDPO underperform in achieving balanced trade-offs. Hybrid GA+RL demonstration data enables the most comprehensive Pareto coverage.
Diagnostic plots confirm that IB-GRPO trajectories are more tightly distributed around the ZPD optimal band and exhibit reduced late-stage difficulty variance. Achieved solutions are superior in balancing learning effect, ZPD compliance, path length accuracy, and diversity.
6. Significance and Extensions
IB-GRPO demonstrates that direct Pareto-efficient policy optimization for LPR, leveraging indicator-guided group advantages and pedagogical alignment, yields substantial improvement over both generic reinforcement learning and simple LLM-based approaches. By explicitly incorporating a differentiable ZPD reward and circumventing fixed scalarization, IB-GRPO promotes more robust and pedagogically sound learning path recommendations that generalize across datasets and path horizons. This suggests that indicator-based group relative methods offer a scalable foundation for multi-objective alignment in complex educational inference tasks.
For a detailed description of the framework, see (Wang et al., 21 Jan 2026).