Law of the Weakest Link: Cross Capabilities of Large Language Models

Ming Zhong, Aston Zhang, Xuewei Wang, Rui Hou, Wenhan Xiong, Chenguang Zhu, Zhengxing Chen, Liang Tan, Chloe Bi, Mike Lewis, Sravya Popuri, Sharan Narang, Melanie Kambadur, Dhruv Mahajan, Sergey Edunov, Jiawei Han, Laurens van der Maaten

2024-10-02

Law of the Weakest Link: Cross Capabilities of Large Language Models

Summary

This paper discusses the 'Law of the Weakest Link' in Large Language Models (LLMs), focusing on how these models perform when combining multiple skills, which is important for real-world tasks.

What's the problem?

Most research on LLMs has focused on individual skills, like reasoning or coding, without considering how these skills work together in complex tasks. This oversight can limit the effectiveness of LLMs when they need to use multiple abilities at once, which is often required in real-life situations.

What's the solution?

To explore this issue, the authors defined seven core skills and created pairs of these skills to form 'cross capabilities.' They developed a benchmark called CrossEval, which includes 1,400 prompts designed to test both individual and cross capabilities. By evaluating 4,200 model responses with expert ratings, they discovered that LLMs often perform poorly in cross-capability tasks because their weakest skill holds back their overall performance.

Why it matters?

This research is significant because it highlights the need to improve the weakest skills in LLMs to enhance their ability to handle complex tasks. Understanding and addressing these weaknesses can lead to better performance in applications that require multiple skills, making LLMs more effective and reliable in various real-world scenarios.

Abstract

The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.

View Paper