Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach

2025-11-27

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Summary

This paper investigates how well large language models, like the ones powering chatbots, can handle tasks of varying difficulty, and whether training them on easy or hard examples is more effective.

What's the problem?

It's unclear whether training a language model on easier tasks or harder tasks helps it perform better overall. Some studies suggest one, some suggest the other, and it's also unclear if improvements seen on easy training data translate to better performance on hard tasks, or vice versa. Essentially, we don't know the best way to prepare these models for real-world challenges which have a mix of easy and hard problems.

What's the solution?

The researchers created a way to objectively measure the difficulty of different examples within several datasets. Instead of relying on humans to judge difficulty, they used the performance of *many* different language models on those examples. They used a statistical method called Item Response Theory, originally used in education to assess test question difficulty, to rank examples from easiest to hardest based on how well the models performed. This allowed them to analyze how well models generalized across different difficulty levels.

Why it matters?

The findings show that simply training on easy or hard examples doesn't guarantee good performance across *all* difficulties. It's important to include a mix of easy, medium, and hard examples in both the training data used to build the model and the evaluation data used to test it. Ignoring the range of difficulty can lead to models that perform poorly on certain types of problems.

Abstract

We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

View Paper