Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber

2025-02-20

Which of These Best Describes Multiple Choice Evaluation with LLMs? A)
Forced B) Flawed C) Fixable D) All of the Above

Summary

This paper talks about the problems with using multiple-choice questions to test AI language models (LLMs) and suggests ways to make these tests better. It's like looking at how we test students and realizing that just giving them multiple-choice exams might not show everything they know or can do.

What's the problem?

Multiple-choice questions are easy to use for testing AI, but they have big problems. They don't test if the AI can create its own answers, they don't match how people actually use AI in real life, and they might not fully check what the AI really knows. Also, the multiple-choice tests we have now have issues like giving away answers accidentally or having questions that can be answered without really understanding the topic.

What's the solution?

The researchers suggest using tests where the AI has to write out full answers and explain its thinking, just like how we might ask students to write essays. They also give ideas to fix multiple-choice tests, like using better ways to write questions and score answers, and making the questions harder in smart ways. They borrow ideas from how we test students in schools to make AI tests better.

Why it matters?

This matters because as AI gets smarter and more important in our lives, we need to be sure we're testing it properly. If we don't test AI well, we might think it's smarter than it really is, or we might miss important things it can't do. By making our tests better, we can create AI that's more useful and trustworthy for real-world tasks. It's about making sure our AI helpers are as smart and capable as we think they are.

Abstract

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.

View Paper