Text Simplification & Curriculum Learning

Updated 24 November 2025

Text simplification is a process that converts complex texts into simpler versions while preserving core meaning using iterative edit-based methods.
Curriculum learning is a training strategy that gradually introduces more complex editing tasks, enabling models to master text transformations.
Integrating curriculum learning with edit-based models leads to improved performance metrics, enhanced delete precision, and reduced train-test mismatches.

Text simplification is the task of generating linguistically simpler versions of complex source texts while preserving meaning and core information content. Curriculum learning is a paradigm in machine learning wherein tasks or samples are presented to the model in a meaningful order, typically progressing from easy to difficult. The intersection of text simplification and curriculum learning has led to substantial advances in both task modeling and evaluation, especially where data scarcity, high annotation cost, and sample diversity present optimization challenges.

1. Task Formulations and Model Architectures

In the controllable text simplification setting, the input $x = (x_1, ..., x_n)$ consists of a complex sentence and a target reading grade level $G_{out}$ , with the output $y^* = (y_1^*, ..., y_m^*)$ as a simplified version of $x$ not exceeding the desired complexity. Decoding is initialized from the non-empty state $y^{(0)} = x$ and iteratively refined through $T$ editing steps, resulting in $y^{(T)} = \hat{y}$ .

Modeling approaches include non-autoregressive edit-based architectures in which each state $y^{(t)}$ is an evolving hypothesis. At each iteration, the model predicts an action $a_t$ , comprising:

A reposition vector $r_t$ (selecting source indices for target positions or token deletions)
A sequence of placeholders (via $T_{plh}$ ) and actual insertions (via $T_{ins}$ )

The generative model computes the joint probability as

$P_\theta(y^*|x) \approx \prod_{t=1}^T P_\theta(a_t \mid y^{(t-1)}, x)$

with further factorization into $P_\theta(r_t \mid y^{(t-1)}, x)$ , the number of masks, and sequential insertions for each placeholder. Parameters $\theta$ are shared across a Transformer encoder–decoder backbone (Agrawal et al., 2022).

In LLM pretraining, paired corpora such as parallel human-written (HW) and LLM-simplified (SIMP) paragraphs have enabled empirical investigation of curriculum-driven data ordering strategies. Model configurations include decoder-only transformers (e.g., MobileLLM, 124M–256M parameter scales) using fixed context windows and consistent optimizer schedules (Roque et al., 29 Sep 2025).

2. Curriculum Learning Methodologies

Curriculum learning is often operationalized by ranking training samples according to their difficulty and exposing the model to increasingly complex samples as its competence grows. In edit-based simplification (Agrawal et al., 2022):

Each pair $(x_i, y^*_i)$ is scored by its Levenshtein distance $d(s_i) = LevDist(x_i, y^*_i)$ , normalized via empirical CDF to [0,1].
The competence schedule $c(t) = \min(1, \sqrt{t/N_{cur}})$ controls pacing, with budget $N_{cur}$ defining curriculum duration.
At step $t$ , only pairs with $d(s_i) \leq c(t)$ are selected, allowing the model to master short-distance edits before confronting more distant (complex) edit pairs.

In pretraining with parallel HW and SIMP corpora (Roque et al., 29 Sep 2025), curriculum alternatives include:

Simple→Complex (SIMP→HW): The model is first trained on all simplified data, then on complex originals.
Interleaved: Uniform, random-mix scheduling between HW and SIMP exemplars.
Anti-curriculum (HW→SIMP): Reversed order, presenting the more complex data first.
Repetition baseline: Two epochs over only the HW data.

In metric learning, such as with REFeREE (Huang et al., 26 Mar 2024), curriculum is staged:

Stage 1: Scalable pretraining with reference-free proxy supervision on synthetic candidates and large corpora.
Stage 2: Intermediate adaptation, introducing conventional reference-based metrics (e.g., BLEU, SARI) with smaller annotated datasets.
Stage 3: Fine-tuning on human ratings, using only a handful of annotations, to maximize alignment with expert quality judgments.

3. Imitation Learning, Train–Test Mismatch, and Roll-in Policies

Edit-based models are typically trained using imitation learning. Standard dual-path roll-in approaches (e.g., mixing model predictions with noised references) are well-suited to machine translation but create a mismatch for text editing, since inference always initiates from the original $x$ , not from perturbed reference outputs.

To resolve this, the editing roll-in procedure was introduced:

Produce $y^{rps}$ by applying noise (shuffle and drop) to $x$ , not $y^*$ .
Compute the oracle optimal reposition $r^*$ aligning $y^{rps} \to y^*$ via Levenshtein alignment.
Derive $y^{ins} = \delta(y^{rps}, r^*)$ by deterministically inserting tokens required for the next state.

These roll-in states ( $y^{rps}, y^{ins}$ ) yield inputs closely mirroring actual inference, improving the reliability of learned edit trajectories (Agrawal et al., 2022).

4. Empirical Results and Impact of Curriculum Learning

Experiments on Newsela-Grade (Agrawal et al., 2022), a paired corpus with explicit reading grade constraints, show that curriculum learning combined with inference-aligned roll-in states leads to improvements in both holistic and fine-grained metrics:

Model Variant	SARI	ARI-Acc	Delete Precision	Corr (ARI)
AR baseline	43.4	34.5%	6.2	0.716
EDITOR (ref roll-in)	45.5	29.7%	2.2	0.656
+ From-Input roll-in	49.3	37.7%	3.6	0.733
+ Editing roll-in	51.7	39.7%	5.2	0.745
+ Curriculum (EDITCL)	53.3	39.8%	4.9	0.747

Curriculum learning (EDITCL) adds +0.7 SARI over editing-only roll-in and +3.8 SARI over the Editor baseline. It also improves delete precision by ~1.6 points and strengthens ARI-accuracy and correlational alignment between generated and target grades (Agrawal et al., 2022).

In pretraining settings with constrained tokens, adding LLM-simplified data confers measurable gains over repeated exposure to the original data. For a 124M parameter model, SIMP→HW (simple-to-complex curriculum) yields higher macro-averaged NLU scores under all fine-tuning budgets compared to baseline. For a 256M model, INTERLEAVED scheduling outperforms all ordered curricula (+1.8–1.9 points in fine-tuning, +0.2–0.7 points in zero-shot evaluation) (Roque et al., 29 Sep 2025).

5. Curriculum-Led Model-Based Metrics for Simplification

REFeREE (Huang et al., 26 Mar 2024) formalizes a three-stage curriculum for reference-free evaluation of text simplification:

Stage 1: Large-scale synthetic pretraining with proxy signals for adequacy, fluency, and simplicity. Data augmentation (scrambling, deletion, source/output swapping) broadens the quality spectrum.
Stage 2: Intermediate adaptation with classic reference-based metrics (BLEU, SARI, BERTScore) using limited parallel data.
Stage 3: Fine-tuning on sparse, high-quality human ratings (overall or aspect-specific).

Each curriculum stage minimizes an $L_2$ loss for relevant signals, with supervision signal sets $S_1 \subset S_2 \subset S_3$ and data partitions $D_1, D_2, D_3$ . Epoch-dependent stage weighting functions $w_i(t)$ control the objective:

$\theta \leftarrow \theta - \eta \nabla_\theta [ w_1(t) L_1(\theta) + w_2(t) L_2(\theta) + w_3(t) L_3(\theta) ]$

REFeREE achieves leading Kendall $\tau$ correlation with SimpEval human ratings ( $\tau = 0.360 \pm 0.020$ , outperforming LENS and BLEURT) and maintains robust performance across adequacy, fluency, and simplicity axes. Ablations show stage 1 synthetic pretraining as particularly impactful; removal degrades $\tau$ by ~0.12. Data augmentation in stage 1 also broadens generalization, especially for low-quality system outputs (Huang et al., 26 Mar 2024).

6. Analysis, Insights, and Best Practices

The following insights emerge from recent research:

Inference-aligned roll-in is essential: Generating roll-in states by noising $x$ ensures edit operations learned at training match test-time requirements; roll-in on reference ( $y^*$ ) introduces a train–test divergence (Agrawal et al., 2022).
True edit distance outperforms proxies: Ranking pairs by Levenshtein distance ( $d(x,y^*)$ ) provides a more effective curriculum than alternative measures like sentence length ratios or grade differences (Agrawal et al., 2022).
Curriculum benefits are model-size dependent: For smaller models or extremely limited fine-tuning, simple-to-complex ordering (SIMP→HW) is most beneficial; larger models gain more from interleaved exposure, leveraging diverse gradients and avoiding phase-specific local minima (Roque et al., 29 Sep 2025).
Complementarity: Curriculum and realistic roll-in mechanisms are synergistic; removing either sharply harms edit and overall quality (Agrawal et al., 2022).
Metric learning generalizes under curriculum: Bootstrapping with large proxy-labeled data enables generalization without human references, grounded further with domain-specific or human-evaluated phases (Huang et al., 26 Mar 2024).

A plausible implication is that, as model scale and task diversity increase, curriculum choice and data representation require problem- and architecture-specific considerations.

7. Limitations and Future Directions

Current limitations include:

Token-uniform edit distance: Treating all tokens equally in curriculum ranking fails to account for syntactic function or semantic salience. Incorporating token-type weighting or end-to-end learned difficulty remains an open challenge (Agrawal et al., 2022).
Limited edit operations: Present frameworks exclude full paraphrasing or elaborative transformations. Extending the oracle to capture richer edit grammars is a viable avenue (Agrawal et al., 2022).
Metric stages marginal gains: In REFeREE, intermediate adaptation (stage 2) offers marginal advantage when compared to direct progression from synthetic proxy to human rating objectives (Huang et al., 26 Mar 2024).
Hyperparameter tuning: Curriculum length, noise rates, and mixing schedules must be adjusted per task and model scale (Agrawal et al., 2022).

Expanding curriculum-based simplification to document-level settings, multilingual corpora, and joint simplification–summarization remains a productive direction. In pretraining, leveraging automatic simplification via LLMs is validated as an effective augmentation strategy under constrained token budgets (Roque et al., 29 Sep 2025).