What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
Dasol Choi, Guijin Son, Hanwool Lee, Minhyuk Kim, Hyunwoo Ko, Teabin Lim, Ahn Eungyeol, Jungwhan Kim, Seunghyeok Hong, Youngsook Song
2026-01-13
Summary
This research focuses on how well vision-language models (VLMs) – AI systems that can understand both images and text – handle questions asked in a way that people naturally ask them, which is often vague and doesn't include all the details.
What's the problem?
Current tests for VLMs use very clear and specific questions. However, when people actually use these models, they tend to ask questions that are more casual and leave out information, assuming the image provides enough context. This means VLMs are being tested on something different than how they're actually used, and they struggle with these real-world, less-defined questions. The researchers found that even the best models fail to answer correctly less than half the time when given these natural, incomplete questions.
What's the solution?
The researchers created a new test dataset called HAERAE-Vision, which includes over 650 real questions taken from Korean online communities. These questions are naturally worded and often lack detail. They also provided a more detailed, rewritten version of each question. By comparing how well models perform on the original versus the rewritten questions, they could see how much the lack of detail affected performance. They tested 39 different VLMs, including very advanced ones like GPT-5 and Gemini 2.5 Pro, and also explored if using web search could help with the vague questions.
Why it matters?
This work shows that a major reason VLMs struggle isn't necessarily because they aren't powerful enough, but because they have trouble understanding what people *mean* when they ask incomplete questions. Simply making the questions more explicit significantly improves performance, even for smaller models. It also reveals that current search technology isn't enough to fill in the gaps left by these underspecified queries, meaning improving how models handle natural language is crucial for making them truly useful in everyday situations.
Abstract
Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), each paired with an explicit rewrite, yielding 1,306 query variants in total. Evaluating 39 VLMs, we find that even state-of-the-art models (GPT-5, Gemini 2.5 Pro) achieve under 50% on the original queries. Crucially, query explicitation alone yields 8 to 22 point improvements, with smaller models benefiting most. We further show that even with web search, under-specified queries underperform explicit queries without search, revealing that current retrieval cannot compensate for what users leave unsaid. Our findings demonstrate that a substantial portion of VLM difficulty stem from natural query under-specification instead of model capability, highlighting a critical gap between benchmark evaluation and real-world deployment.