DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu

2025-12-12

DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Summary

This paper introduces a new approach, called DuetSVG, for creating Scalable Vector Graphics (SVGs) from text descriptions. It's about making computers better at turning words into images that are defined by mathematical shapes, like the ones used for logos or illustrations.

What's the problem?

Current methods that generate SVGs from text often struggle with creating complex images that look good and make sense visually. They only focus on the text and don't really 'see' what the image should look like during the creation process, leading to errors in how shapes are drawn or how they relate to each other. Basically, they can understand *what* to draw, but not *how* to draw it well.

What's the solution?

DuetSVG solves this by creating a model that generates both the image itself (as a series of visual 'tokens') and the SVG code simultaneously. It learns from examples of both images and their corresponding SVG code. Then, when creating a new SVG, it uses its own 'visual predictions' as a guide to make sure the SVG code creates an image that actually looks right, improving the final result.

Why it matters?

This research is important because it improves the quality of automatically generated SVGs. Better SVG generation has applications in many areas, like creating graphics for websites, designing user interfaces, and even generating illustrations for different purposes. By making the images more visually accurate and structurally sound, it opens up possibilities for more automated design processes.

Abstract

Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

View Paper