OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
Zhen Huang, Zengzhi Wang, Shijie Xia, Pengfei Liu
2024-06-25
Summary
This report evaluates and ranks the intelligence of various AI models using a new approach inspired by Olympic medal rankings. It focuses on recent models like Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o, assessing their performance across multiple disciplines.
What's the problem?
Determining which AI model is the most intelligent can be challenging because there are many different models with varying strengths and weaknesses. Existing methods for measuring AI performance often lack a comprehensive and fair way to compare these models across different subjects and tasks.
What's the solution?
The authors propose a new ranking system that uses an Olympic medal table format to compare the performance of AI models based on their results in the OlympicArena benchmark. This benchmark tests models on a wide range of subjects, including physics, chemistry, and biology. The findings show that Claude-3.5-Sonnet performs better than GPT-4o in some areas, while Gemini-1.5-Pro and GPT-4V follow closely behind. The report also highlights that open-source models generally perform worse than proprietary ones, indicating room for improvement.
Why it matters?
This research is important because it provides a clearer way to evaluate and compare AI models, helping researchers and developers understand which models excel in specific areas. By identifying strengths and weaknesses, this ranking can guide future improvements in AI technology, moving us closer to achieving superintelligent systems.
Abstract
In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).