LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin, Yan Zhang, Xing Sun, Rongrong Ji

2025-11-05

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Summary

This paper points out a big problem with how we currently test large language models (LLMs) like ChatGPT: the tests give us numbers, but don't actually show us if the models *understand* space and physical relationships in the real world.

What's the problem?

Right now, we judge LLMs based on scores from tests, but these scores can be misleading. A model might score well on a test, but still struggle with tasks that require understanding how things fit together in space, like imagining what an object would look like from a different angle or following directions to build something. This is a problem because if we don't know if a model understands space, we can't trust it to do things in the real world.

What's the solution?

The researchers created a new testing method called LTD-Bench. Instead of just giving a score, LTD-Bench asks the LLM to *show* its understanding by either drawing a picture using dots or writing code to create an image. This way, you can actually *see* if the model understands the spatial relationships it's being asked about. They tested this on several advanced LLMs with tasks of varying difficulty, looking at both how well the models could create images from descriptions and how well they could understand descriptions of images.

Why it matters?

This work is important because it reveals that even LLMs that seem smart based on traditional tests actually have a surprisingly poor grasp of spatial reasoning. This is a fundamental limitation that needs to be addressed if we want to build AI that can truly interact with and understand the physical world. LTD-Bench also provides a way to diagnose *why* a model is failing, which can help researchers improve these models in the future.

Abstract

Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept--a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.

View Paper