Qworld: Question-Specific Evaluation Criteria for LLMs
Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik
2026-03-26
Summary
This paper introduces a new way to judge how well large language models, like advanced chatbots, answer open-ended questions. It focuses on the fact that a 'good' answer really depends on *what* is being asked, and current methods aren't good at capturing that nuance.
What's the problem?
Evaluating LLMs is tricky because what makes an answer good changes depending on the specific question. Simply giving a score of 'right' or 'wrong' or using a standard set of rules doesn't work well. Existing methods try to create rules for entire datasets at once, or generate them only once per question, which means they miss important details and different angles that each question might require. They don't really explore all the ways a question could be evaluated.
What's the solution?
The researchers developed a method called Qworld, which creates evaluation criteria *specifically* for each question. It works by breaking down a question into different scenarios, viewpoints, and very specific things an answer should include. Think of it like building a tree of ideas – starting with the main question and branching out into all the related things a good answer needs to address. This results in a detailed checklist for judging the quality of the response to that particular question.
Why it matters?
This is important because Qworld allows for a much more accurate and insightful evaluation of LLMs. It can reveal differences in how well models handle complex issues like long-term consequences, fairness, dealing with errors, and combining knowledge from different fields – things that simpler evaluation methods often miss. Ultimately, it helps us better understand the strengths and weaknesses of these powerful AI systems.
Abstract
Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.