PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Junlin Xie, Yu Qiao, Peng Gao, Hongsheng Li

2024-09-24

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Summary

This paper introduces PixWizard, a versatile visual assistant that helps users generate, manipulate, and translate images based on natural language instructions. It combines various image tasks into one framework to improve how images are processed and created.

What's the problem?

Many existing image processing tools struggle to understand complex language instructions or only handle specific tasks. This limits their usability and makes it difficult for users to achieve their desired outcomes when working with images. There is a need for a more flexible system that can handle multiple visual tasks using simple language commands.

What's the solution?

To solve this problem, the researchers developed PixWizard, which uses a large dataset called the Omni Pixel-to-Pixel Instruction-Tuning Dataset (OPPIT) to train the model on a wide range of image tasks. PixWizard employs Diffusion Transformers (DiT) to process images at any resolution and includes features that help it understand both the structure and meaning of the input images. This allows it to perform tasks like text-to-image generation, image restoration, and editing while following user instructions effectively.

Why it matters?

This research is significant because it provides a powerful tool for anyone working with images, from artists to designers. By enabling users to give simple language commands for complex tasks, PixWizard makes image editing and generation more accessible and efficient, paving the way for innovative applications in creative industries.

Abstract

This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at https://github.com/AFeng-x/PixWizard

View Paper