VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang
2025-11-05
Summary
This paper explores how well AI models can 'code' visuals, specifically by creating SVG images from pictures. It argues that using code to represent images is a good way to make sure the AI truly understands what it's seeing, not just recognizing patterns.
What's the problem?
Current AI models are really good at tasks involving language and generating code based on text prompts, but they struggle when it comes to understanding images and turning them into code. The researchers noticed a gap in the ability to translate visual information into a precise, coded representation like SVG. Essentially, AI can write code from words, but not so much from pictures, and this limits its ability to reason about visual scenes.
What's the solution?
The researchers created a new benchmark called VCode, which challenges AI models to generate SVG code from images. They also developed a way to test if the generated SVG code actually *means* the same thing as the original image, using a 'CodeVQA' system that asks questions about the image and the generated code. To improve performance, they built VCoder, a system that helps AI models think step-by-step, revising their SVG code, and using tools to identify objects and shapes within the image to guide the coding process.
Why it matters?
This work is important because it highlights a weakness in current AI systems – their limited ability to truly understand and represent visual information in a structured way. By focusing on visual coding, the researchers are pushing AI towards a deeper understanding of images, which is crucial for applications like robotics, image editing, and any task where AI needs to reason about the visual world. It shows that representing visuals as code can help ensure the AI isn't just 'seeing' but actually 'understanding'.
Abstract
Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.