Rethinking the Role of Text Complexity in Language Model Pretraining (2509.16551v2)

Published 20 Sep 2025 in cs.CL and cs.AI

Abstract: Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity--how hard a text is to read--remains less explored. We reduce surface-level complexity (shorter sentences, simpler words, simpler structure) while keeping core content approximately constant and ask: (i) How does complexity affect language modeling across model sizes? (ii) Can useful representations be learned from simpler text alone? (iii) How does pretraining text complexity influence downstream language understanding? We simplify human-written texts using a LLM, pretrain causal models (28M-500M) from scratch on original vs. simplified data, and evaluate them in fine-tuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity--smaller models degrade far less on simpler texts--while text complexity has little impact on fine-tuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking. Our findings suggest that different types of data diversity affect transfer and zero-shot performance differently, providing insight into tailoring data curation to specific goals.

Summary

The paper demonstrates that language models pretraining on simplified texts reduce perplexity losses, especially in smaller models.
It employs Llama 3.1 to generate simplified versions of texts while preserving core semantic content for effective comparison.
The findings suggest that text complexity impacts zero-shot and fine-tuned task performance differently, informing data curation strategies.

Rethinking the Role of Text Complexity in LLM Pretraining

Introduction and Research Questions

The importance of data quality and size in model pretraining has been recognized, yet the impact of text complexity remains underexplored. This paper investigates whether pretrained models require complex texts to learn effective representations or if simpler texts suffice. The authors seek to answer three primary questions: how text complexity affects language modeling across model sizes, whether useful representations can be derived from simpler texts, and how text complexity influences downstream performance on language understanding tasks. To address these questions, human-written texts were simplified using a LLM, after which models of varying sizes (28M–500M parameters) were pretrained independently on both original and simplified datasets.

Methodology

The simplified corpus was created by prompting Llama 3.1 to generate more straightforward versions of human-written texts while preserving core content. The corpus metrics reveal that the simplified texts feature reduced lexical diversity and syntactic complexity while maintaining semantic similarity to the originals (Figure 1).

Figure 1: Corpus Features distribution indicates that simplified texts are shorter, use simpler structures, and maintain semantic similarity to original texts, highlighting systematic differences in readability.

LLMs were pretrained from scratch and later evaluated through fine-tuning and zero-shot testing. The fine-tuning involved tasks from GLUE and SuperGLUE, while zero-shot tests examined linguistic knowledge, entity tracking, and commonsense reasoning.

Results

The paper showed that model performance on language modeling tasks is sensitive to the interaction between model capacity and text complexity. Smaller models experienced less performance degradation on simpler texts, which suggests that such models benefit from the reduced complexity (Figure 2).

Figure 2: Perplexity degradation is more pronounced for models trained on human-written texts as size decreases, indicating smaller models manage lower-complexity texts better.

For fine-tuned language understanding tasks, text complexity showed minimal impact, with both human-written and simplified text trainings achieving comparable results. However, in zero-shot scenarios, differences emerged: simpler texts augmented performance on tasks focused on linguistic knowledge, while more complex texts favored tasks needing world knowledge and entity tracking.

Implications and Future Directions

These findings imply that data diversity affects model performance variably, depending on whether the task is concerned with linguistic versatility or domain-specific knowledge. For practitioners, this suggests that curation strategies should align with intended applications, where fine-tuned models may overlook surface variation without detriment, while zero-shot configurations might necessitate it. In synthetic data generation, variety must extend beyond mere surface-level differences to encompass topical and knowledge diversity, ensuring comprehensive model understanding across tasks.

Future research should explore multi-dimensional optimization of dataset diversity to harness broader performance benefits, balancing between computational efficiency and performance gains across diverse task spectrums. Further, studies should investigate simplifying large-scale datasets in ways that meticulously preserve semantic content while ensuring computational and dimensional efficiency.

Conclusion

The paper revisits the role of text complexity in LLM pretraining and concludes that surfacing-level complexity has minimal influence on downstream tasks performance, particularly when evaluated through traditional NLP tasks like those in GLUE and SuperGLUE. Simpler text structures benefit smaller models in terms of efficiency and, in zero-shot contexts, may enhance linguistic knowledge tasks, although sophisticated textual data remains crucial for tasks demanding comprehensive world knowledge and sophisticated reasoning. Thus, the paper provides new insights into when and how to adjust data complexity relative to model capacity and task requirements.