The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou
2025-02-14
Summary
This paper talks about whether large language models (LLMs) truly understand physical concepts or if they just repeat patterns they've seen before. The researchers created a special test called PhysiCo to check how well these models grasp physical ideas by using abstract grids instead of regular text.
What's the problem?
Even though LLMs can describe physical concepts in natural language, it's unclear if they actually understand these ideas or if they're just mimicking patterns. This is important because understanding physical concepts is key for tasks like reasoning and problem-solving, but current tests often let models rely on memorization rather than real comprehension.
What's the solution?
The researchers designed PhysiCo, a test that uses grid-based inputs to represent physical phenomena in different ways, like core ideas, examples, and analogies. This format removes the chance for LLMs to rely on word-matching tricks and forces them to show real understanding. They tested popular models like GPT-4o and found that while these models can describe concepts well in language, they struggle significantly with the grid-based tasks, showing a gap in actual understanding.
Why it matters?
This matters because it highlights the limits of current LLMs when it comes to truly understanding the world. By identifying these weaknesses, the research helps guide future improvements in AI, making it better at reasoning and solving complex problems. This could lead to smarter and more reliable AI systems for real-world applications.
Abstract
In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art <PRE_TAG>LLMs</POST_TAG>, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.