SketchVLM: Vision language models can annotate images to explain thoughts and guide users

Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen

2026-04-28

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

Summary

This paper introduces SketchVLM, a new way for image-understanding computer programs to explain their answers. Instead of just giving a text response, these programs can now draw directly on the image to show *why* they think something is true.

What's the problem?

Current image-understanding programs, even really advanced ones, only explain themselves using text. This is a problem because it's hard for people to understand *how* the program arrived at its answer just by reading words. It's like having a friend solve a puzzle and just telling you the answer without showing you their thought process – you don't know if they made a mistake or if you're missing something.

What's the solution?

The researchers created SketchVLM, which doesn't require changing the original program itself. It adds a layer that lets the program create simple drawings (using a format called SVG) *on top* of the image to highlight important parts and explain its reasoning. Think of it like highlighting and labeling an image to show what the program is focusing on. They tested this on tasks like navigating mazes, predicting where a ball will land, counting objects, and labeling parts of pictures, and it significantly improved accuracy and the quality of the explanations.

Why it matters?

This is important because it makes these powerful image-understanding programs much more trustworthy and useful. If a program can *show* you why it thinks something, you're more likely to believe it and understand its limitations. It also opens the door for better collaboration between humans and AI, where people can give feedback on the program's reasoning and help it learn.

Abstract

When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.

View Paper