WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

Ahmed Elhady, Eneko Agirre, Mikel Artetxe

2025-02-26

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More
Challenging

Summary

This paper talks about WiCkeD, a new method to make multiple-choice tests harder for AI by adding a 'None of the above' option to the answers, which forces the AI to think more carefully.

What's the problem?

AI models that answer multiple-choice questions often rely on tricks or patterns in the data rather than truly understanding the questions. Current tests don't always challenge the AI's reasoning abilities enough, making it hard to measure how well these models actually understand the material.

What's the solution?

The researchers created WiCkeD, which randomly replaces one of the multiple-choice options with 'None of the above.' If the correct answer is replaced, 'None of the above' becomes the right choice. This method makes tests more challenging and reveals how well AI models can handle reasoning tasks. They tested WiCkeD on six popular benchmarks and found that AI performance dropped significantly, showing that this approach effectively challenges the models.

Why it matters?

This matters because it helps researchers better evaluate how smart AI models really are by testing their reasoning skills more thoroughly. It also highlights weaknesses in current AI systems, pushing for improvements that could make them more reliable and capable of handling complex tasks in the future.

Abstract

We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice <PRE_TAG>benchmarks</POST_TAG> by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using <PRE_TAG>chain-of-thought</POST_TAG> on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.

View Paper