FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, Gordon Wetzstein

2024-12-11

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

Summary

This paper talks about FiVA, a new dataset and framework that helps improve how AI generates images from text by allowing users to control specific visual details like lighting and texture.

What's the problem?

Current AI models for generating images from text often struggle to accurately capture and combine different visual attributes, such as colors, lighting, and textures. This makes it difficult for users, especially those who aren't experts in art, to create images that match their specific preferences or needs.

What's the solution?

The authors created the FiVA dataset, which includes around 1 million high-quality images with detailed annotations about their visual attributes. They also developed a system called FiVA-Adapter that allows users to mix and match these attributes from different images. This means users can easily customize generated images by selecting specific features like how bright or dark the image should be or what textures to include, leading to more personalized results.

Why it matters?

This research is important because it enhances the capabilities of AI in creative fields like art and design. By providing a way to control fine details in image generation, FiVA empowers users to create unique visuals that better reflect their ideas and intentions. This can lead to more innovative applications in various industries, including advertising, entertainment, and education.

Abstract

Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.

View Paper