Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes like lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, letting users apply characteristics like lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attributes adaptation framework (FiVA-Adapter) , which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.
Visual attributes encompass a broad spectrum, varying across use cases. To address this, we identified general attribute types to cover diverse applications, categorizing them into groups and refining each into detailed subcategories. Redundant or unreasonable entries were filtered out. Similarly, we developed a taxonomy for subjects. Attributes and their augmentations were then paired with specific subjects to generate prompts using a state-of-the-art image generation model, enabling easy pairing of images with shared attributes. Finally, pair accuracy was validated through human evaluation, incorporating a range-sensitive filter introduced below.
It is worth noting that data constructed using this method does not guarantee precise physical or pixel-level pairing but only ensures rough consistency. Nevertheless, it enables large-scale data construction and supports most customization needs.
Not all generated images with the same attribute exhibit similar visual effects. For example, attributes like "color" and "stroke" transfer easily across different subjects, while others, such as "lighting" and "dynamics," are range-sensitive, producing varying effects depending on the subject's domain. We use powerful multimodal large language models like GPT-4 to automatically define an attribute's application range, ensuring greater visual consistency between images within that range.
FiVA-Adapter has two key design components: 1) Attribute-specific visual prompt extractor: The Q-former module takes both the image and attribute instruction as inputs to model the semantic relationship between them, extracting attribute-specific image condition features. 2) Multi-image dual cross-attention module: Features from the Q-former (up to N) are zero padded, concatenated, and shuffled before being fed into the cross-attention module, ensuring insensitivity to the order of attributes.
Our method achieves superior performance in both visual attribute and text subject accuracy.
We can incorporate different attributes from multiple reference images and integrate them into the target subject, while also being capable of extracting various visual attributes from the same reference image based on distinct attribute names.
@inproceedings{wu2024fiva,
title={Fi{VA}: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models},
author={Tong Wu and Yinghao Xu and Ryan Po and Mengchen Zhang and Guandao Yang and Jiaqi Wang and Ziwei Liu and Dahua Lin and Gordon Wetzstein},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=Vp6HAjrdIg}
}