UQ: Assessing Language Models on Unsolved Questions

Fan Nie, Ken Ziyu Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff

2025-08-26

UQ: Assessing Language Models on Unsolved Questions

Summary

This paper introduces a new way to test how good AI models are, moving away from typical question-and-answer benchmarks. It focuses on evaluating models by seeing if they can answer questions that *humans haven't already solved*.

What's the problem?

Current AI benchmarks have a problem: either they're too artificial and don't reflect real-world problems, or they're based on common questions that are too easy for advanced AI. It's hard to find a balance between challenging the AI and making sure the questions are actually useful and representative of what people need help with. Existing benchmarks often become 'solved' quickly, and don't continue to push the boundaries of AI capabilities.

What's the solution?

The researchers created UQ, a platform with 500 difficult questions taken from Stack Exchange – a website where people ask and answer questions. These aren't simple questions; they're ones that remain unanswered. They use a multi-step process to ensure the questions are good, using both AI and human reviewers. They also developed a system where AI can pre-screen potential answers, and experts then verify if the answers are actually correct. The platform is designed to be constantly updated with new, unsolved questions.

Why it matters?

This is important because it provides a more realistic and ongoing way to measure AI progress. By focusing on unsolved problems, it forces AI to truly *reason* and *discover* new information, rather than just recalling facts it was trained on. Successfully answering these questions actually contributes to human knowledge, making the benchmark directly valuable. It sets a new standard for evaluating AI on complex, open-ended challenges.

Abstract

Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.

View Paper