InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang

2025-10-14

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Summary

This paper introduces a new approach to working with SVG (Scalable Vector Graphics) images using powerful AI models called multimodal large language models, or MLLMs. They’ve created a whole system – a dataset, a way to test models, and the model itself – designed to handle understanding, editing, and creating SVG images all in one place.

What's the problem?

Currently, working with SVGs is difficult because there aren't many good, organized collections of SVG data available. Existing methods don't easily adapt to different SVG tasks, and it's hard for computers to understand the complex structure of these images, especially when they involve moving parts or detailed designs. Basically, it's a fragmented field lacking good resources and unified tools.

What's the solution?

The researchers built a massive dataset called SAgoge, which includes a wide variety of SVGs, from simple icons to complex animations and scientific diagrams. They also created SArena, a standardized benchmark to evaluate how well different models perform on these SVG tasks. Then, they developed InternSVG, an MLLM specifically designed for SVGs, using special techniques to help it understand the SVG format and learn effectively, starting with simpler images and moving to more complex ones.

Why it matters?

This work is important because it provides a unified way to handle all sorts of SVG tasks with a single model. By creating a large, well-organized dataset and a standardized benchmark, they’re making it easier for researchers to develop and compare new SVG tools. InternSVG significantly outperforms existing models, meaning we’re closer to AI that can reliably understand, edit, and generate high-quality vector graphics.

Abstract

General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

View Paper