Thinking with Drafting: Optical Decompression via Logical Reconstruction
Jingxuan Wei, Honghao He, Caijun Jia, Siyuan Li, Zheng Sun, Yuhang Xu, Yuanyuan Lin, Linzhuang Sun, Yuchen Wu, Bihui Yu, Xiangxiang Zhang, Cheng Tan
2026-02-13
Summary
This paper addresses a key weakness in current AI systems that can 'see' and even 'create' images: they struggle with truly understanding and reasoning about what they see, especially when it involves logic or precise calculations.
What's the problem?
While AI models are good at recognizing objects in images and generating new images, they often fail when asked to solve problems *using* those images, particularly if the problem requires logical thinking or exactness. They can identify symbols, but don't grasp the relationships between them, and generated images can contain errors that a human would immediately notice. Essentially, they see pixels but don't understand the underlying concepts.
What's the solution?
The researchers propose a new approach called 'Thinking with Drafting' (TwD). Instead of directly guessing the answer to a visual problem, the AI first creates a step-by-step plan, written in a simple computer language, to solve it. This plan is like the AI 'showing its work.' Then, it uses this plan to generate visual proofs, which it can check to make sure its reasoning is correct. They also created a new test, VisAlg, specifically designed to challenge these reasoning abilities.
Why it matters?
This work is important because it moves AI closer to genuine visual understanding. By forcing the AI to explain its reasoning process and verify its answers, it becomes more reliable and less prone to errors. This isn't just about making pretty pictures; it's about building AI that can actually *think* with images, which has huge implications for fields like robotics, scientific discovery, and education.
Abstract
Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.