Symbolic Graphics Programming with Large Language Models

Yamei Chen, Haoquan Zhang, Yangyi Huang, Zeju Qiu, Kaipeng Zhang, Yandong Wen, Weiyang Liu

2025-09-08

Symbolic Graphics Programming with Large Language Models

Summary

This paper investigates how well large language models can create code that generates images, specifically using a type of code called Scalable Vector Graphics (SVGs). It explores if these models truly 'understand' what they're drawing based on text descriptions, and how to improve their ability to do so.

What's the problem?

While large language models are good at writing code in general, they aren't very good at creating SVG code that accurately represents what a user asks for in plain language. Existing open-source models perform significantly worse than the best, commercially available models, and it's unclear why. The challenge is getting the models to generate code that both works (creates a valid image) and actually *looks* like the description.

What's the solution?

The researchers used a technique called reinforcement learning to train a language model (Qwen-2.5-7B) to generate better SVGs. They gave the model rewards based on two things: first, whether the generated code created a valid image at all, and second, how well the resulting image matched the original text description, using advanced image recognition technology to compare them. This process helped the model learn to break down images into simpler shapes and add details that make scenes more realistic.

Why it matters?

This work is important because it provides a way to test how well AI understands the connection between language and visual concepts. By focusing on SVG code, which is precise and interpretable, the researchers can see exactly *how* the model is trying to represent an image, offering insights into its reasoning and helping to build more visually intelligent AI systems.

Abstract

Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

View Paper