< Explain other AI papers

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga

2024-12-05

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Summary

This paper introduces U-MATH, a new benchmark designed to evaluate the mathematical skills of large language models (LLMs) using university-level problems.

What's the problem?

Current methods for testing LLMs' math skills are limited because they often focus on smaller, simpler problems, mostly at the elementary or high school level. They also don't cover a wide range of topics or include visual elements, which are important for real-world applications. This means that LLMs might not be accurately assessed on their ability to handle more complex, university-level math problems.

What's the solution?

To address these issues, U-MATH provides a set of 1,100 open-ended university-level math problems that cover six core subjects and include 20% multimodal problems that require visual understanding. The benchmark allows LLMs to be evaluated on their ability to solve these problems and includes a dataset called mu-MATH to assess how well LLMs can judge the correctness of solutions. This comprehensive approach helps ensure that the evaluation is thorough and relevant to real-world scenarios.

Why it matters?

This research is significant because it fills a gap in the evaluation of LLMs' mathematical abilities, providing a more accurate assessment of their skills in handling complex problems. By focusing on university-level content and incorporating visual elements, U-MATH can help improve the development of AI systems that need to perform advanced mathematical reasoning, which is essential in fields like engineering, finance, and science.

Abstract

The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release mu-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on mu-MATH.