Jodi: Unification of Visual Generation and Understanding via Joint Modeling

Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, Xilin Chen

2025-05-27

Jodi: Unification of Visual Generation and Understanding via Joint
Modeling

Summary

This paper talks about Jodi, a new AI system that can both create images and understand them, all within one model, using a special approach called diffusion and a transformer that can switch roles depending on the task.

What's the problem?

The problem is that most AI models are either good at generating images or at understanding them, but not both at the same time. This means you usually need separate systems for creating visuals and for analyzing or interpreting them, which is less efficient and less flexible.

What's the solution?

The researchers built Jodi by combining techniques so the model can switch between making images and understanding them using the same core technology. This joint modeling allows Jodi to handle a wide range of tasks, like creating new visuals, recognizing what's in an image, or even controlling how images are generated, all in one system.

Why it matters?

This is important because it makes AI much more versatile and powerful, allowing it to help with creative projects, visual analysis, and many other tasks without needing multiple separate models, making technology more accessible and useful for everyone.

Abstract

Jodi, a diffusion framework using a linear diffusion transformer and role switch mechanism, unifies visual generation and understanding, performing joint, controllable, and perceptual tasks effectively across multiple visual domains.

View Paper