How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel, Siva Reddy, Dzmitry Bahdanau

2025-02-21

How to Get Your LLM to Generate Challenging Problems for Evaluation

Summary

This paper talks about CHASE, a new system that uses AI to create tough problems for testing other AI models. It's like teaching one computer to come up with really hard questions to see how smart another computer is.

What's the problem?

As AI models get smarter, it’s becoming harder and more expensive for humans to create challenging problems to test them. Traditional methods of writing these problems take a lot of time and effort, especially when the questions need to be very detailed and high-quality.

What's the solution?

The researchers created CHASE, a system that uses AI itself to build difficult problems for testing. It works step by step, starting with simple ideas and combining them into harder questions. The process is broken down into smaller tasks that can be checked to make sure the problems are correct. They used CHASE to make tests in areas like answering questions based on documents, completing code, and solving math problems.

Why it matters?

This matters because it helps us keep up with the rapid improvement of AI by providing a way to test its abilities without relying on humans to write all the questions. The problems created by CHASE are tough enough to challenge even advanced AI models, which is important for understanding their strengths and weaknesses. This could lead to better ways of improving AI and making sure it’s ready for real-world tasks.

Abstract

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

View Paper