The Impossible Test: A 2024 Unsolvable Dataset and A Chance for an AGI Quiz
David Noever, Forrest McKee
2024-11-25

Summary
This paper introduces 'The Impossible Test,' a new evaluation framework designed to assess how well large language models (LLMs) can recognize when they don't know the answer to difficult questions.
What's the problem?
Many LLMs can generate answers that sound plausible, but they often struggle to admit when they truly don't know something, especially with complex or unsolvable problems. This can lead to misleading or incorrect responses, which is a significant issue in evaluating their intelligence and reliability.
What's the solution?
The researchers created a dataset of 675 questions that are fundamentally unsolvable, meaning the correct answer is either that humans don't know the answer or that it's currently impossible to solve. They tested twelve advanced LLMs on these questions to see how often they would admit ignorance instead of providing incorrect answers. The study found that while some models performed reasonably well in acknowledging their limitations, there were notable differences in their performance based on the difficulty of the questions and the subject matter.
Why it matters?
This research is important because it highlights a critical aspect of evaluating artificial intelligence: the ability to recognize its own knowledge limits. By focusing on how models handle uncertainty, this study contributes to the ongoing discussions about artificial general intelligence (AGI) and suggests ways to improve future AI systems by training them to better understand when they do not have enough information.
Abstract
This research introduces a novel evaluation framework designed to assess large language models' (LLMs) ability to acknowledge uncertainty on 675 fundamentally unsolvable problems. Using a curated dataset of graduate-level grand challenge questions with intentionally unknowable answers, we evaluated twelve state-of-the-art LLMs, including both open and closed-source models, on their propensity to admit ignorance rather than generate plausible but incorrect responses. The best models scored in 62-68% accuracy ranges for admitting the problem solution was unknown in fields ranging from biology to philosophy and mathematics. We observed an inverse relationship between problem difficulty and model accuracy, with GPT-4 demonstrating higher rates of uncertainty acknowledgment on more challenging problems (35.8%) compared to simpler ones (20.0%). This pattern indicates that models may be more prone to generate speculative answers when problems appear more tractable. The study also revealed significant variations across problem categories, with models showing difficulty in acknowledging uncertainty in invention and NP-hard problems while performing relatively better on philosophical and psychological challenges. These results contribute to the growing body of research on artificial general intelligence (AGI) assessment by highlighting the importance of uncertainty recognition as a critical component of future machine intelligence evaluation. This impossibility test thus extends previous theoretical frameworks for universal intelligence testing by providing empirical evidence of current limitations in LLMs' ability to recognize their own knowledge boundaries, suggesting new directions for improving model training architectures and evaluation approaches.