- The paper demonstrates that instruction fine-tuning with discrete control tokens significantly improves controllability in automatic text simplification with low MAE on target metrics.
- The methodology leverages manually curated instruction prompts, stratified sampling based on readability and compression, and evaluation using metrics like FKGL, SARI, and COMET.
- Findings highlight that data quality and attribute diversity are critical for effective simplification, enabling even smaller models to achieve competitive performance.
Controllable Instruction Fine-Tuning for Automatic Text Simplification with CATS
Automatic Text Simplification (ATS) seeks to generate text that is easier to read while minimally altering semantic content. User-controllable simplification is critical to satisfy diverse readability needs; however, previous approaches typically conflate control with model decoding strategies and evaluate outputs with task-agnostic metrics, ignoring the necessity for explicit alignment between target attributes and system outputs. The "Taming CATS" paper "Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens" (2604.01779) systematically analyzes these deficiencies and introduces an instruction fine-tuning protocol using control tokens to directly steer open-source LLMs to specific attribute values.
Framework and Methodology
The proposed CATS framework encapsulates control via discrete tokens injected into the instruction prompt, guiding generation toward fixed target readability (FKGL, ARI, Dale-Chall) or length compression (character/word ratios). Data from four domains—medicine (MED-EASI), public administration (SIMPA), encyclopedic (WIKILARGE), and news (NEWSELA)—is harmonized and rigorously stratified to mitigate attribute distribution mismatch between train/validation/test splits.
The fine-tuning process involves:
- Manually curated instruction templates: To maximize prompt variability and reduce exposure bias.
- Control token embedding: Placed at the assistant's output onset, encoding the desired attribute value or compression goal.
- Model family/size sweep: Llama (1–13B), Mistral (3–7B), Qwen (1.7–14B).
- Split stratification: Employing FKGL or character count-based stratified sampling (with Kolmogorov-Smirnov divergence minimization) to maintain representativeness.
- Filtered monotonic datasets: Subsets wherein all simplifications are strictly simpler than their sources, to examine the impact of data cleaning on model performance.
Evaluation Protocol
Performance is evaluated over three axes:
- Controllability: Mean Absolute Error (MAE) between the generated output's attribute value and the target.
- Simplification Quality: SARI (edit-centric), LENS (learnable, human-aligned), and COMET (semantic adequacy).
- Similarity: BLEU and BERTScore to both source and reference.
Robust inference is ensured through multiple seeds and aggregated results to mitigate LLM stochasticity.
Main Results
Data Experiments
Partitioning by readability (FKGL) or character count consistently minimizes distribution divergence among splits, whereas native partitioning leads to non-representative validation/test sets. Over-cleaning data by enforcing monotonicity sometimes reduces model–target alignment, indicating that excessive pruning can harm the diversity needed for robust controllability.
Attribute Learning and Model Scaling
- Readability control: All tested LLMs (including 1–3B models) successfully learn to target absolute readability levels (FKGL, ARI, Dale-Chall) with low MAE when sufficient variation is encoded in the training corpus.
- Compression control: Fails to generalize when source–simplification pairs exhibit little length variation (especially in sentence-level datasets). Document-level training signals (as in NEWSELA) are more informative.
- Scaling laws: Larger models (up to 14B) yield non-monotonic improvements. Some smaller Qwen and Llama checkpoints match or outperform larger ones in controllability; performance gains plateau without further data signal.
Metric Divergence and Control
Traditional metrics (BLEU, SARI, LENS) do not directly measure compliance to control targets, often rewarding copying or near-paraphrases regardless of attribute deviation. The authors demonstrate strong negative correlation between MAE and SARI/LENS, indicating the necessity for targeted, error-based evaluation in CATS.
Representation Across Models/Domains
Model family and domain effects are non-trivial: Qwen models achieve lowest MAE with strong cross-domain stability, while Mistral and Llama yield higher SARI on certain datasets. Performance differences are mostly attributable to dataset characteristics, not architecture or size.
Implications and Future Directions
Data Cruciality: CATS effectiveness is strongly data-bound. The attribute variation encoded in the train set defines the possible control range; limited compression or readability shifts handicap controllability regardless of model scale or architecture.
Metric Design: There is a critical need for control-specific, error-based automatic metrics in ATS to replace or complement proxy metrics such as BLEU or SARI, as traditional simplification metrics do not penalize deviation from the desired attribute.
Model Selection: The competitive performance of 1–3B parameter models signals that architectural and scale choices can be subordinate to model adaptability to data and fine-tuning pipeline robustness—a key implication for resource-constrained deployment scenarios.
Pipeline and Evaluation Standardization: Uniform, model-native chat templates and stratified sampling/study design are necessary to ensure reproducibility and comparability across CATS systems. Reporting split representativeness via distributional divergence is best practice.
Future Directions:
- Construction of attribute-diverse, sentence-aligned corpora and multi-domain training resources.
- Extension to multilingual and stylistically diverse simplification.
- Integration of human-centric controllability evaluation for end-user-facing explainable simplification systems.
- Advanced normalization/cross-attribute comparison protocols for multifactor control.
Conclusion
This work presents strong empirical evidence that instruction fine-tuning with discrete control tokens is a potent and flexible paradigm for CATS, provided the training corpus contains sufficient and balanced attribute signal. MAE-based controllability should be a standard metric for measuring system compliance with user instructions, and data-centric evaluation and split design are as significant as modeling innovations. Further evolution in controllable ATS will require simultaneous advances in data quality, attribute diversity, and error-sensitive automatic evaluation (2604.01779).