TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles
Qingchen Yu, Shichao Song, Ke Fang, Yunfeng Shi, Zifan Zheng, Hanyu Wang, Simin Niu, Zhiyu Li
2024-10-08

Summary
This paper introduces TurtleBench, a new benchmark designed to evaluate how well large language models (LLMs) can solve real-world yes/no puzzles, focusing on their reasoning abilities rather than just knowledge recall.
What's the problem?
Existing methods for evaluating LLMs often use static datasets that don't accurately reflect how these models perform in dynamic, real-world situations. Many benchmarks focus on whether a model can recall facts instead of testing its ability to reason logically. This makes it hard to understand how well these models can actually think through problems and make decisions based on complex information.
What's the solution?
To solve this issue, the authors developed TurtleBench, which collects real user guesses from an online game called Turtle Soup Puzzle. This game generates a variety of yes/no questions that require logical reasoning to answer. By using user-generated data, TurtleBench creates a dynamic evaluation dataset that helps ensure the tests are relevant and challenging. The authors evaluated nine advanced LLMs using this benchmark and found that even the best-performing models struggled with many of the puzzles, indicating that there is still room for improvement in their reasoning capabilities.
Why it matters?
This research is important because it highlights the need for better evaluation methods for language models that reflect their performance in real-world scenarios. By focusing on reasoning rather than just knowledge recall, TurtleBench helps researchers understand the true strengths and weaknesses of LLMs. This can lead to improvements in how these models are designed and trained, making them more effective for practical applications where logical reasoning is essential.
Abstract
As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model's logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with genuine user needs for reasoning capabilities, thus enhancing the reliability of evaluations. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. Using this dataset, we thoroughly evaluated nine of the most advanced LLMs available today. Notably, the OpenAI o1 series models did not achieve leading results in these evaluations. We propose several hypotheses for further research, such as "the latent reasoning of o1 utilizes trivial Chain-of-Thought (CoT) techniques" and "increasing CoT length not only provides reasoning benefits but also incurs noise costs."