Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens

Published 2 Apr 2026 in cs.CL | (2604.01779v1)

Abstract: Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that instruction fine-tuning with discrete control tokens significantly improves controllability in automatic text simplification with low MAE on target metrics.
The methodology leverages manually curated instruction prompts, stratified sampling based on readability and compression, and evaluation using metrics like FKGL, SARI, and COMET.
Findings highlight that data quality and attribute diversity are critical for effective simplification, enabling even smaller models to achieve competitive performance.

Controllable Instruction Fine-Tuning for Automatic Text Simplification with CATS

Problem Formulation and Motivation

Automatic Text Simplification (ATS) seeks to generate text that is easier to read while minimally altering semantic content. User-controllable simplification is critical to satisfy diverse readability needs; however, previous approaches typically conflate control with model decoding strategies and evaluate outputs with task-agnostic metrics, ignoring the necessity for explicit alignment between target attributes and system outputs. The "Taming CATS" paper "Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens" (2604.01779) systematically analyzes these deficiencies and introduces an instruction fine-tuning protocol using control tokens to directly steer open-source LLMs to specific attribute values.

Framework and Methodology

The proposed CATS framework encapsulates control via discrete tokens injected into the instruction prompt, guiding generation toward fixed target readability (FKGL, ARI, Dale-Chall) or length compression (character/word ratios). Data from four domains—medicine (MED-EASI), public administration (SIMPA), encyclopedic (WIKILARGE), and news (NEWSELA)—is harmonized and rigorously stratified to mitigate attribute distribution mismatch between train/validation/test splits.

The fine-tuning process involves:

Manually curated instruction templates: To maximize prompt variability and reduce exposure bias.
Control token embedding: Placed at the assistant's output onset, encoding the desired attribute value or compression goal.
Model family/size sweep: Llama (1–13B), Mistral (3–7B), Qwen (1.7–14B).
Split stratification: Employing FKGL or character count-based stratified sampling (with Kolmogorov-Smirnov divergence minimization) to maintain representativeness.
Filtered monotonic datasets: Subsets wherein all simplifications are strictly simpler than their sources, to examine the impact of data cleaning on model performance.

Evaluation Protocol

Performance is evaluated over three axes:

Controllability: Mean Absolute Error (MAE) between the generated output's attribute value and the target.
Simplification Quality: SARI (edit-centric), LENS (learnable, human-aligned), and COMET (semantic adequacy).
Similarity: BLEU and BERTScore to both source and reference.

Robust inference is ensured through multiple seeds and aggregated results to mitigate LLM stochasticity.

Main Results

Data Experiments

Partitioning by readability (FKGL) or character count consistently minimizes distribution divergence among splits, whereas native partitioning leads to non-representative validation/test sets. Over-cleaning data by enforcing monotonicity sometimes reduces model–target alignment, indicating that excessive pruning can harm the diversity needed for robust controllability.

Attribute Learning and Model Scaling

Readability control: All tested LLMs (including 1–3B models) successfully learn to target absolute readability levels (FKGL, ARI, Dale-Chall) with low MAE when sufficient variation is encoded in the training corpus.
Compression control: Fails to generalize when source–simplification pairs exhibit little length variation (especially in sentence-level datasets). Document-level training signals (as in NEWSELA) are more informative.
Scaling laws: Larger models (up to 14B) yield non-monotonic improvements. Some smaller Qwen and Llama checkpoints match or outperform larger ones in controllability; performance gains plateau without further data signal.

Metric Divergence and Control

Traditional metrics (BLEU, SARI, LENS) do not directly measure compliance to control targets, often rewarding copying or near-paraphrases regardless of attribute deviation. The authors demonstrate strong negative correlation between MAE and SARI/LENS, indicating the necessity for targeted, error-based evaluation in CATS.

Representation Across Models/Domains

Model family and domain effects are non-trivial: Qwen models achieve lowest MAE with strong cross-domain stability, while Mistral and Llama yield higher SARI on certain datasets. Performance differences are mostly attributable to dataset characteristics, not architecture or size.

Implications and Future Directions

Data Cruciality: CATS effectiveness is strongly data-bound. The attribute variation encoded in the train set defines the possible control range; limited compression or readability shifts handicap controllability regardless of model scale or architecture.

Metric Design: There is a critical need for control-specific, error-based automatic metrics in ATS to replace or complement proxy metrics such as BLEU or SARI, as traditional simplification metrics do not penalize deviation from the desired attribute.

Model Selection: The competitive performance of 1–3B parameter models signals that architectural and scale choices can be subordinate to model adaptability to data and fine-tuning pipeline robustness—a key implication for resource-constrained deployment scenarios.

Pipeline and Evaluation Standardization: Uniform, model-native chat templates and stratified sampling/study design are necessary to ensure reproducibility and comparability across CATS systems. Reporting split representativeness via distributional divergence is best practice.

Future Directions:

Construction of attribute-diverse, sentence-aligned corpora and multi-domain training resources.
Extension to multilingual and stylistically diverse simplification.
Integration of human-centric controllability evaluation for end-user-facing explainable simplification systems.
Advanced normalization/cross-attribute comparison protocols for multifactor control.

Conclusion

This work presents strong empirical evidence that instruction fine-tuning with discrete control tokens is a potent and flexible paradigm for CATS, provided the training corpus contains sufficient and balanced attribute signal. MAE-based controllability should be a standard metric for measuring system compliance with user instructions, and data-centric evaluation and split design are as significant as modeling innovations. Further evolution in controllable ATS will require simultaneous advances in data quality, attribute diversity, and error-sensitive automatic evaluation (2604.01779).

Markdown Report Issue