IB-GRPO: Pareto-Optimal Learning Path Framework

Updated 2 February 2026

IB-GRPO is a multi-objective learning path recommendation framework that uses vector-valued rewards to optimize pedagogical objectives such as learning effect and ZPD alignment.
The framework integrates genetic algorithms and offline RL for expert warm-start, followed by indicator-based policy optimization without manual scalarization.
Empirical evaluations on ASSIST09 and Junyi datasets demonstrate its superior balance in optimizing learning effect, path diversity, and operational constraints.

IB-GRPO (Indicator-Based Group Relative Policy Optimization) is a learning path recommendation (LPR) framework that aligns LLM-based policies with pedagogical objectives such as learning effect maximization, @@@@1@@@@ (ZPD) alignment, operational constraints, and path diversity, by leveraging a vector-valued reward structure and direct Pareto frontier optimization without manual scalarization (Wang et al., 21 Jan 2026).

1. Motivation and Pedagogical Objectives

Long-horizon LPR requires generating sequences of learning items personalized to individual students in order to:

Maximize long-term learning effect: Enhance a student’s post-instruction proficiency.
Schedule exercise difficulty to match the ZPD: Adjust item difficulty to maintain challenge within a beneficial proficiency band.
Respect operational constraints: Adhere to desired path lengths and session constraints.
Maintain diversity: Avoid repetitive recommendations by supporting a wide range of plausible learning trajectories.

Standard LLMs, trained to optimize next-token likelihood, exhibit myopic planning and do not natively conform to these multi-faceted pedagogical needs. Simple scalar reduction of multi-objective signals risks “locking in” suboptimal trade-offs. IB-GRPO introduces a vector reward formulation and directly targets the Pareto-optimal frontier, sidestepping ad hoc weighting and promoting adaptive balancing of objectives.

2. Formal Framework and Multi-Objective MDP

The LPR setting is formalized as an episodic Markov Decision Process (MDP):

State ( $s$ ): Encapsulates the student's latent proficiency ( $a$ ), interaction history $H=\{(c_1, y_1),\dots,(c_k, y_k)\}$ , and prompt features (e.g., target path length, prior recommendations).
Action ( $\pi_t$ ): Choosing the next knowledge concept or exercise from a candidate set.
Trajectory ( $\pi$ ): An ordered sequence $(\pi_1,\dots,\pi_L)$ generated auto-regressively by the parameterized LLM policy $\mu_\theta$ .

The reward function $r(s, \pi) \in \mathbb{R}^4$ is a vector comprising:

Learning Effect $E_p(\pi)$ :

$E_p(\pi) = (E_e - E_s)/(E_{sup} - E_s)$

where $E_s, E_e$ are pre/post-test scores and $E_{sup}$ the maximum.

ZPD Alignment $S_{ZPD}(\pi)$ :

$S_{ZPD}(\pi) = \frac{1}{L}\sum_t \exp\left(-\frac{(d(\pi_t) - z(a))^2}{2\sigma^2}\right)$

$d(\pi_t)$ is the item difficulty; $z(a)$ the optimal-difficulty center for proficiency $a$ .

Length Constraint $R_{Len}(\pi)$ :

$R_{Len}(\pi) = \begin{cases} 1.0, & |\Delta| \leq \tau \ -\lambda (|\Delta| - \tau), & |\Delta| > \tau \end{cases}$

where $\Delta = |L - L_{target}|$ .

Diversity $D_{Div}(\pi; \mathcal{B})$ :

$D_{Div}(\pi; \mathcal{B}) = 1 - \frac{1}{|\mathcal{B}|-1} \sum_{\pi' \neq \pi} \text{Sim}_{Jaccard}(\pi, \pi')$

$\mathcal{B}$ is the sampled path group, and similarity is measured over $n$ -gram overlaps.

The learning objective is to train $\mu_\theta$ so that non-dominated trajectories sampled from the policy closely approximate the true Pareto frontier of the vectorized reward, without scalar collapse.

3. IB-GRPO Algorithmic Procedure

IB-GRPO operates in two distinct stages: a hybrid expert warm-start via supervised fine-tuning (SFT), followed by indicator-based group relative policy optimization.

3.1 Stage I: Hybrid Expert Warm-Start

Genetic Algorithm Expert: Each trajectory is encoded as a chromosome; tournament selection, crossover, and mutation drive exploration of the learning effect reward with diversity.
Policy-based Teacher (Offline RL): Pre-trained LPR agents (e.g., CSEAL, GEPKSD) complement genetic search, especially in sparse or low-performing regions.
Behavior Cloning SFT: Expert trajectories from both sources comprise dataset $\mathcal{D}_{sft} = \{(s, \pi_{expert})\}$ . The LLM is warm-started to maximize the log-likelihood:

$L_{SFT}(\theta) = -\mathbb{E}_{(s, \pi_{expert}) \sim \mathcal{D}_{sft}} \left[ \sum_{t=1}^L \log \mu_\theta(\pi_{expert, t} \mid s, \pi_{expert, <t}) \right]$

Outcome: Warm-started policy $\mu_{sft}(\pi|s)$ .

3.2 Stage II: Indicator-Based Group Relative Policy Optimization

Training Iteration Procedure:

Group Sampling: For each state in the batch, sample $K$ trajectories $\{\pi_1, \dots, \pi_K\}$ using the current policy.
Reward Vectorization: Compute reward vector $r_i = r(s, \pi_i) \in \mathbb{R}^4$ for each trajectory.
$I_{\epsilon+}$ Dominance Indicator: For pairs $(\pi_j, \pi_i)$ ,

$I_{\epsilon+}(r_j, r_i) = \max_{m \in \{1, \dots, 4\}} [ r_{i,m} - r_{j,m} ]$

quantifies the minimum uniformly needed augmentation for $r_j$ to weakly dominate $r_i$ .

Pareto Fitness $R_i$ :

$R_i = \sum_{j \neq i} -\exp \left( -\frac{I_{\epsilon+}(r_j, r_i)}{\kappa} \right)$

with scaling parameter $\kappa > 0$ .

Group-Relative Advantage $A_i$ : Standardize $R_i$ within the group

$A_i = \frac{R_i - \overline{R}}{\operatorname{std}(R) + \epsilon}$

Policy Update: Optimize using importance sampling and asymmetric clipping

$L_{IB}(\theta) = \mathbb{E} \left[ \frac{1}{K} \sum_{i=1}^K \mathrm{clipped} \left( \frac{\mu_\theta}{\mu_{old}} \right) \cdot A_i \right]$

$\mathrm{clipped}(r) = \min( \max(r, 1-\epsilon_{low}), 1+\epsilon_{high} )$ , with $\epsilon_{low} = 0.2$ , $\epsilon_{high} = 0.28$ .

This pairwise dominance approach facilitates direct discovery of Pareto-optimal policies and circumvents the drawbacks of manual scalarization.

4. Training Workflow and Pseudocode

The core training protocol can be summarized as follows:

Stage	Key Steps
Warm-start SFT	Generate expert trajectories (GA, RL agents); Aggregate dataset $\mathcal{D}_{sft}$ ; Fine-tune LLM via cross-entropy loss on experts.
IB-GRPO Optimization	For $N$ epochs: sample state batches; for each state, sample $K$ trajectories, compute vector rewards, all pairwise $I_{\epsilon+}$ , $R_i$ , $A_i$ ; update $\theta$ on $L_{IB}$ ; update reference policy.

This approach synthesizes search-based and offline RL-generated data for broad solution coverage, then proceeds with indicator-based policy optimization.

5. Empirical Evaluation

Experiments are conducted on the ASSIST09 and Junyi datasets with a DKT-based Knowledge Evolution Simulator (KES). Key experimental characteristics:

Datasets:
- ASSIST09: 167 concepts, 4,217 learners, 346,860 records.
- Junyi: 835 concepts, 525,061 learners, 21,460,249 records.
Simulator: KES (DKT environment) for counterfactual student response simulation.
Model backbone: Qwen2.5-7B LLM (8 × A800-40GB GPUs).
Key hyperparameters: Group size $K=8$ , indicator scale $\kappa$ tuned per validation, clipping $[1-0.2, 1+0.28]$ .

Baselines:

Non-RL: DKTRec
General RL: DQN, Actor-Critic, PPO
Education-specific RL: CSEAL, GEPKSD
LLM-based: GenAL, ReAL

Metrics:

Learning Effect ( $E_p$ ): (higher is better)
LenScore: $\max\{0, 1 - |L - L_{target}| / L_{target}\} \in [0,1]$
Path Diversity ( $Div_{path}$ ): $|\text{uniq}(\pi)|/L$
ZPD alignment ( $S_{ZPD}$ )

Main Results:

Dataset	Baseline (L=20)	IB-GRPO (L=20)
Junyi	ReAL: 0.5724	0.7743
ASSIST09	GEPKSD: 0.5837	0.5911

IB-GRPO surpasses all baselines across metrics and path lengths. Ablations reveal that removal of ZPD rewards degrades long-horizon planning, and alternatives to $I_{\epsilon+}$ such as HVO or GDPO underperform in achieving balanced trade-offs. Hybrid GA+RL demonstration data enables the most comprehensive Pareto coverage.

Diagnostic plots confirm that IB-GRPO trajectories are more tightly distributed around the ZPD optimal band and exhibit reduced late-stage difficulty variance. Achieved solutions are superior in balancing learning effect, ZPD compliance, path length accuracy, and diversity.

6. Significance and Extensions

IB-GRPO demonstrates that direct Pareto-efficient policy optimization for LPR, leveraging indicator-guided group advantages and pedagogical alignment, yields substantial improvement over both generic reinforcement learning and simple LLM-based approaches. By explicitly incorporating a differentiable ZPD reward and circumventing fixed scalarization, IB-GRPO promotes more robust and pedagogically sound learning path recommendations that generalize across datasets and path horizons. This suggests that indicator-based group relative methods offer a scalable foundation for multi-objective alignment in complex educational inference tasks.

For a detailed description of the framework, see (Wang et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

IB-GRPO: Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InK-GRPO.