X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
2024-12-03

Summary
This paper introduces X-Prompt, a new method that improves how vision-language models generate images by using examples to help them understand and create a wide variety of images.
What's the problem?
While large language models (LLMs) have been effective in generating text and images from prompts, there hasn't been enough exploration into how these models can learn from examples in real-time for generating images. This limits their ability to adapt to new tasks or styles when creating images, especially when the tasks are not something they have seen before.
What's the solution?
X-Prompt addresses this issue by creating a framework that allows the model to learn from examples provided in the context of the task. It uses a special design that helps the model compress important features from these examples, enabling it to handle longer sequences of input data. The model is trained to predict both text and images simultaneously, which improves its understanding and ability to generate images based on new situations. Extensive tests show that X-Prompt performs well on various tasks, including those it hasn't encountered before.
Why it matters?
This research is important because it enhances the capabilities of vision-language models, making them more versatile and effective in generating images across different contexts. By improving how these models learn from examples, X-Prompt can be applied in many fields such as art creation, advertising, and virtual reality, where high-quality image generation is essential.
Abstract
In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.