Relationship Between Task Difficulty and Generalization Performance in LLMs

Determine the relationship between task difficulty and generalization performance in large language models, specifically assessing whether training on easier tasks leads to improved performance on harder tasks and whether training on harder tasks improves performance on easier tasks across evaluation benchmarks.

Background

Multiple recent studies report contradictory findings regarding whether easy-to-hard or hard-to-easy generalization holds for LLMs. Some works claim easy training data suffices for hard tasks, others find hard examples are most beneficial, and still others argue matching train-test difficulty yields the best results. This inconsistency motivates a clearer characterization of how performance varies with task difficulty.

The paper proposes measuring difficulty via Item Response Theory (IRT) using responses from thousands of models, aiming to avoid human-centric difficulty metrics that may misrepresent what is difficult for LLMs. Despite the authors’ analysis, the broader relationship is acknowledged as an open question.

References

As shown in Table \ref{tab:difficulty_tension}, despite ongoing research in this area, the relationship between generalization performance and task difficulty remains an open question.

— Revisiting Generalization Across Difficulty Levels: It's Not So Easy (2511.21692 - Kordi et al., 26 Nov 2025) in Section 1, Introduction

Relationship Between Task Difficulty and Generalization Performance in LLMs

Background

References

Related Problems