The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

Vernon Y. H. Toh, Yew Ken Chia, Deepanway Ghosal, Soujanya Poria

2025-02-04

The Jumping Reasoning Curve? Tracking the Evolution of Reasoning
Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

Summary

This paper talks about how AI models like GPT and OpenAI's o-series are getting better at reasoning, especially when it comes to solving puzzles that involve both images and text. It looks at how these models have improved over time and compares their abilities to human-level thinking.

What's the problem?

While new AI models like o3 are really good at solving certain types of puzzles, they're mostly tested on simple pattern-based problems. But in real life, we often need to understand both pictures and words together to solve problems. The researchers wanted to see how well these AI models could handle more complex puzzles that use both visual and language information, which is closer to how humans think.

What's the solution?

The researchers created a set of challenging puzzles that require understanding both images and text. They then tested different versions of GPT and o-series models on these puzzles to see how they performed. They tracked how the models improved with each new version and compared their abilities. They also looked at how much computing power these models needed to solve the puzzles.

Why it matters?

This research matters because it helps us understand how close AI is getting to human-level reasoning. By testing AI on more complex, real-world-like problems, we can see where these models excel and where they still need improvement. This information is crucial for developing better AI systems that can help with tasks requiring both visual and language understanding, like in education, healthcare, or creative fields. It also raises important questions about the efficiency of these models, as the best-performing ones currently require a lot of computing power.

Abstract

The releases of OpenAI's o1 and o3 mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, o3 outperformed humans in novel problem-solving and skill acquisition on the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models on challenging multimodal puzzles, requiring fine-grained visual perception with abstract or algorithmic reasoning. The superior performance of o1 comes at nearly 750 times the computational cost of GPT-4o, raising concerns about its efficiency. Our results reveal a clear upward trend in reasoning capabilities across model iterations, with notable performance jumps across GPT-series models and subsequently to o1. Nonetheless, we observe that the o1 model still struggles with simple multimodal puzzles requiring abstract reasoning. Furthermore, its performance in algorithmic puzzles remains poor. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available https://github.com/declare-lab/LLM-PuzzleTest.

View Paper