Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng

2026-04-01

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Summary

This paper introduces a new approach to creating images from text descriptions, focusing on generating realistic images even when the descriptions involve obscure or complex knowledge.

What's the problem?

Current image generation models are good, but they struggle when asked to create images about things they haven't specifically been trained on, especially things requiring a lot of real-world knowledge. They mostly rely on information already built into their system and can't easily access or use new information to create accurate images of less common concepts.

What's the solution?

The researchers developed a system called Unify-Agent that works like an agent completing a task. It first understands the text prompt, then actively searches for relevant information online, uses that information to refine its understanding of what to create, and finally generates the image. They also created a large dataset of examples and a new benchmark called FactIP to specifically test this kind of knowledge-based image generation.

Why it matters?

This work is important because it shows that giving image generation models the ability to 'think' and actively seek out information can significantly improve their ability to create accurate and detailed images, even about things they haven't seen before. It moves us closer to models that can truly understand and represent the world around us.

Abstract

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

View Paper