Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou

2025-12-23

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Summary

This paper investigates whether powerful AI models, like those used for chatbots, can accurately judge how hard a question or task is for a human student, something crucial for creating good tests and learning materials.

What's the problem?

Currently, figuring out the difficulty of questions is tough, especially when you're starting with new material – this is called the 'cold start problem'. While AI can *solve* problems really well, it's not clear if it can understand *why* a problem is hard for a person, or predict where a student might struggle. The paper shows that simply making the AI bigger and better at solving problems doesn't automatically mean it gets better at judging difficulty.

What's the solution?

Researchers tested over 20 different AI models on a wide range of subjects, like medical knowledge and math. They compared the AI's difficulty ratings to how humans actually found the problems. They found that the AI models tended to agree with each other, forming a 'machine consensus', but this consensus didn't always match human perceptions of difficulty. They also tried prompting the AI to think like a student with limited knowledge, but even then, the AI often failed to accurately estimate difficulty and didn't seem to recognize its own limitations.

Why it matters?

This research is important because it shows that being good at solving problems isn't the same as understanding how people learn and struggle. It suggests that we can't rely on current AI models to automatically create or assess educational materials, because they lack the ability to truly understand a student's cognitive process and predict where they'll have trouble.

Abstract

Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

View Paper