Papers
Topics
Authors
Recent
Search
2000 character limit reached

IB-GRPO: Pareto-Optimal Learning Path Framework

Updated 2 February 2026
  • IB-GRPO is a multi-objective learning path recommendation framework that uses vector-valued rewards to optimize pedagogical objectives such as learning effect and ZPD alignment.
  • The framework integrates genetic algorithms and offline RL for expert warm-start, followed by indicator-based policy optimization without manual scalarization.
  • Empirical evaluations on ASSIST09 and Junyi datasets demonstrate its superior balance in optimizing learning effect, path diversity, and operational constraints.

IB-GRPO (Indicator-Based Group Relative Policy Optimization) is a learning path recommendation (LPR) framework that aligns LLM-based policies with pedagogical objectives such as learning effect maximization, @@@@1@@@@ (ZPD) alignment, operational constraints, and path diversity, by leveraging a vector-valued reward structure and direct Pareto frontier optimization without manual scalarization (Wang et al., 21 Jan 2026).

1. Motivation and Pedagogical Objectives

Long-horizon LPR requires generating sequences of learning items personalized to individual students in order to:

  • Maximize long-term learning effect: Enhance a student’s post-instruction proficiency.
  • Schedule exercise difficulty to match the ZPD: Adjust item difficulty to maintain challenge within a beneficial proficiency band.
  • Respect operational constraints: Adhere to desired path lengths and session constraints.
  • Maintain diversity: Avoid repetitive recommendations by supporting a wide range of plausible learning trajectories.

Standard LLMs, trained to optimize next-token likelihood, exhibit myopic planning and do not natively conform to these multi-faceted pedagogical needs. Simple scalar reduction of multi-objective signals risks “locking in” suboptimal trade-offs. IB-GRPO introduces a vector reward formulation and directly targets the Pareto-optimal frontier, sidestepping ad hoc weighting and promoting adaptive balancing of objectives.

2. Formal Framework and Multi-Objective MDP

The LPR setting is formalized as an episodic Markov Decision Process (MDP):

  • State (ss): Encapsulates the student's latent proficiency (aa), interaction history H={(c1,y1),,(ck,yk)}H=\{(c_1, y_1),\dots,(c_k, y_k)\}, and prompt features (e.g., target path length, prior recommendations).
  • Action (πt\pi_t): Choosing the next knowledge concept or exercise from a candidate set.
  • Trajectory (π\pi): An ordered sequence (π1,,πL)(\pi_1,\dots,\pi_L) generated auto-regressively by the parameterized LLM policy μθ\mu_\theta.

The reward function r(s,π)R4r(s, \pi) \in \mathbb{R}^4 is a vector comprising:

  1. Learning Effect Ep(π)E_p(\pi):

Ep(π)=(EeEs)/(EsupEs)E_p(\pi) = (E_e - E_s)/(E_{sup} - E_s)

where Es,EeE_s, E_e are pre/post-test scores and EsupE_{sup} the maximum.

  1. ZPD Alignment SZPD(π)S_{ZPD}(\pi):

SZPD(π)=1Ltexp((d(πt)z(a))22σ2)S_{ZPD}(\pi) = \frac{1}{L}\sum_t \exp\left(-\frac{(d(\pi_t) - z(a))^2}{2\sigma^2}\right)

d(πt)d(\pi_t) is the item difficulty; z(a)z(a) the optimal-difficulty center for proficiency aa.

  1. Length Constraint RLen(π)R_{Len}(\pi):

RLen(π)={1.0,Δτ λ(Δτ),Δ>τR_{Len}(\pi) = \begin{cases} 1.0, & |\Delta| \leq \tau \ -\lambda (|\Delta| - \tau), & |\Delta| > \tau \end{cases}

where Δ=LLtarget\Delta = |L - L_{target}|.

  1. Diversity DDiv(π;B)D_{Div}(\pi; \mathcal{B}):

DDiv(π;B)=11B1ππSimJaccard(π,π)D_{Div}(\pi; \mathcal{B}) = 1 - \frac{1}{|\mathcal{B}|-1} \sum_{\pi' \neq \pi} \text{Sim}_{Jaccard}(\pi, \pi')

B\mathcal{B} is the sampled path group, and similarity is measured over nn-gram overlaps.

The learning objective is to train μθ\mu_\theta so that non-dominated trajectories sampled from the policy closely approximate the true Pareto frontier of the vectorized reward, without scalar collapse.

3. IB-GRPO Algorithmic Procedure

IB-GRPO operates in two distinct stages: a hybrid expert warm-start via supervised fine-tuning (SFT), followed by indicator-based group relative policy optimization.

3.1 Stage I: Hybrid Expert Warm-Start

  • Genetic Algorithm Expert: Each trajectory is encoded as a chromosome; tournament selection, crossover, and mutation drive exploration of the learning effect reward with diversity.
  • Policy-based Teacher (Offline RL): Pre-trained LPR agents (e.g., CSEAL, GEPKSD) complement genetic search, especially in sparse or low-performing regions.
  • Behavior Cloning SFT: Expert trajectories from both sources comprise dataset Dsft={(s,πexpert)}\mathcal{D}_{sft} = \{(s, \pi_{expert})\}. The LLM is warm-started to maximize the log-likelihood:

LSFT(θ)=E(s,πexpert)Dsft[t=1Llogμθ(πexpert,ts,πexpert,<t)]L_{SFT}(\theta) = -\mathbb{E}_{(s, \pi_{expert}) \sim \mathcal{D}_{sft}} \left[ \sum_{t=1}^L \log \mu_\theta(\pi_{expert, t} \mid s, \pi_{expert, <t}) \right]

  • Outcome: Warm-started policy μsft(πs)\mu_{sft}(\pi|s).

3.2 Stage II: Indicator-Based Group Relative Policy Optimization

Training Iteration Procedure:

  1. Group Sampling: For each state in the batch, sample KK trajectories {π1,,πK}\{\pi_1, \dots, \pi_K\} using the current policy.
  2. Reward Vectorization: Compute reward vector ri=r(s,πi)R4r_i = r(s, \pi_i) \in \mathbb{R}^4 for each trajectory.
  3. Iϵ+I_{\epsilon+} Dominance Indicator: For pairs (πj,πi)(\pi_j, \pi_i),

Iϵ+(rj,ri)=maxm{1,,4}[ri,mrj,m]I_{\epsilon+}(r_j, r_i) = \max_{m \in \{1, \dots, 4\}} [ r_{i,m} - r_{j,m} ]

quantifies the minimum uniformly needed augmentation for rjr_j to weakly dominate rir_i.

  1. Pareto Fitness RiR_i:

Ri=jiexp(Iϵ+(rj,ri)κ)R_i = \sum_{j \neq i} -\exp \left( -\frac{I_{\epsilon+}(r_j, r_i)}{\kappa} \right)

with scaling parameter κ>0\kappa > 0.

  1. Group-Relative Advantage AiA_i: Standardize RiR_i within the group

Ai=RiRstd(R)+ϵA_i = \frac{R_i - \overline{R}}{\operatorname{std}(R) + \epsilon}

  1. Policy Update: Optimize using importance sampling and asymmetric clipping

LIB(θ)=E[1Ki=1Kclipped(μθμold)Ai]L_{IB}(\theta) = \mathbb{E} \left[ \frac{1}{K} \sum_{i=1}^K \mathrm{clipped} \left( \frac{\mu_\theta}{\mu_{old}} \right) \cdot A_i \right]

clipped(r)=min(max(r,1ϵlow),1+ϵhigh)\mathrm{clipped}(r) = \min( \max(r, 1-\epsilon_{low}), 1+\epsilon_{high} ), with ϵlow=0.2\epsilon_{low} = 0.2, ϵhigh=0.28\epsilon_{high} = 0.28.

This pairwise dominance approach facilitates direct discovery of Pareto-optimal policies and circumvents the drawbacks of manual scalarization.

4. Training Workflow and Pseudocode

The core training protocol can be summarized as follows:

Stage Key Steps
Warm-start SFT Generate expert trajectories (GA, RL agents); Aggregate dataset Dsft\mathcal{D}_{sft}; Fine-tune LLM via cross-entropy loss on experts.
IB-GRPO Optimization For NN epochs: sample state batches; for each state, sample KK trajectories, compute vector rewards, all pairwise Iϵ+I_{\epsilon+}, RiR_i, AiA_i; update θ\theta on LIBL_{IB}; update reference policy.

This approach synthesizes search-based and offline RL-generated data for broad solution coverage, then proceeds with indicator-based policy optimization.

5. Empirical Evaluation

Experiments are conducted on the ASSIST09 and Junyi datasets with a DKT-based Knowledge Evolution Simulator (KES). Key experimental characteristics:

  • Datasets:
    • ASSIST09: 167 concepts, 4,217 learners, 346,860 records.
    • Junyi: 835 concepts, 525,061 learners, 21,460,249 records.
  • Simulator: KES (DKT environment) for counterfactual student response simulation.
  • Model backbone: Qwen2.5-7B LLM (8 × A800-40GB GPUs).
  • Key hyperparameters: Group size K=8K=8, indicator scale κ\kappa tuned per validation, clipping [10.2,1+0.28][1-0.2, 1+0.28].

Baselines:

  • Non-RL: DKTRec
  • General RL: DQN, Actor-Critic, PPO
  • Education-specific RL: CSEAL, GEPKSD
  • LLM-based: GenAL, ReAL

Metrics:

  • Learning Effect (EpE_p): (higher is better)
  • LenScore: max{0,1LLtarget/Ltarget}[0,1]\max\{0, 1 - |L - L_{target}| / L_{target}\} \in [0,1]
  • Path Diversity (DivpathDiv_{path}): uniq(π)/L|\text{uniq}(\pi)|/L
  • ZPD alignment (SZPDS_{ZPD})

Main Results:

Dataset Baseline (L=20) IB-GRPO (L=20)
Junyi ReAL: 0.5724 0.7743
ASSIST09 GEPKSD: 0.5837 0.5911

IB-GRPO surpasses all baselines across metrics and path lengths. Ablations reveal that removal of ZPD rewards degrades long-horizon planning, and alternatives to Iϵ+I_{\epsilon+} such as HVO or GDPO underperform in achieving balanced trade-offs. Hybrid GA+RL demonstration data enables the most comprehensive Pareto coverage.

Diagnostic plots confirm that IB-GRPO trajectories are more tightly distributed around the ZPD optimal band and exhibit reduced late-stage difficulty variance. Achieved solutions are superior in balancing learning effect, ZPD compliance, path length accuracy, and diversity.

6. Significance and Extensions

IB-GRPO demonstrates that direct Pareto-efficient policy optimization for LPR, leveraging indicator-guided group advantages and pedagogical alignment, yields substantial improvement over both generic reinforcement learning and simple LLM-based approaches. By explicitly incorporating a differentiable ZPD reward and circumventing fixed scalarization, IB-GRPO promotes more robust and pedagogically sound learning path recommendations that generalize across datasets and path horizons. This suggests that indicator-based group relative methods offer a scalable foundation for multi-objective alignment in complex educational inference tasks.

For a detailed description of the framework, see (Wang et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InK-GRPO.