Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Jooyeol Yun, Jaegul Choo
2025-12-17
Summary
This paper focuses on making it easier for computers to automatically create animations from Scalable Vector Graphics, which are commonly used for images on the web.
What's the problem?
Current artificial intelligence models, specifically those that understand both images and language, struggle with animating SVGs because SVGs are built from many tiny, separate shapes. The AI doesn't understand which shapes belong together and should move as a single unit, leading to choppy and unrealistic animations. It's like trying to animate a car when you only see it as a bunch of individual lines and curves instead of a whole vehicle.
What's the solution?
The researchers developed a new system that first figures out the 'meaningful parts' of an SVG image – grouping those individual shapes into larger, logical components like wheels, bodies, or arms. They do this by combining several slightly inaccurate guesses about these parts to arrive at a more reliable understanding of the image's structure. This allows the AI to then animate these grouped parts in a coordinated way, resulting in smoother and more coherent animations.
Why it matters?
This work is important because it unlocks the potential for AI to create more complex and natural-looking animations from vector graphics. This could lead to more dynamic and interactive web content, and it helps bridge the gap between how humans understand images and how computers process them, making interactions with AI more intuitive.
Abstract
Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.