The customization of text-to-image models has seen significant advancements, yet generating multiple personalized concepts remains a challenging task. Current methods struggle with attribute leakage and layout confusion when handling multiple concepts, leading to reduced concept fidelity and semantic consistency.
In this work, we introduce a novel training-free framework, Concept Conductor, designed to ensure visual fidelity and correct layout in multi-concept customization. Concept Conductor isolates the sampling processes of multiple custom models to prevent attribute leakage between different concepts and corrects erroneous layouts through self-attention-based spatial guidance. Additionally, we present a concept injection technique that employs shape-aware masks to specify the generation area for each concept. This technique injects the structure and appearance of personalized concepts through feature fusion in the attention layers, ensuring harmony in the final image.
Extensive qualitative and quantitative experiments demonstrate that Concept Conductor can consistently generate composite images with accurate layouts while preserving the visual details of each concept. Compared to existing baselines, Concept Conductor shows significant performance improvements. Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts.
(a) Attribute leakage denotes the application of one concept’s attributes to another (e.g., a cat acquiring the fur and eyes of a dog).
(b) Concept omission indicates one or more target concepts not appearing in the image (e.g., the absence of the target cat).
(c) Subject redundancy refers to the appearance of extra subjects similar to the target concept (e.g., an extra cat).
(d) Appearance truncation signifies the target concept’s appearance being observed only in a partial area of the subject (e.g., the upper half of a dog and the lower half of a cat).
Our method comprises three key components: Multipath Sampling, Layout Alignment, and Concept Injection. At each denoising step, the input latent vector is first corrected to ′ by the Layout Alignment module. ′ is then sent to the Concept Injection module for denoising. Both Layout Alignment and Concept Injection utilize the Multipath Sampling structure. After denoising, our method can generate images that align with the given text prompt and visual concepts.
Custom models and are created by adding ED-LoRA to the base model . The base prompt and edited prompts are sent to the base model and custom models, respectively. Different models receive the same latent input and predict different noises. Self-attention features , , , and the output feature maps of the attention layers , , are recorded during this process.
The self-attention feature of the layout reference image is extracted through DDIM inversion, which is then used to compute the loss with , and , updating the input latent vector . For simplicity, the conversion from pixel space to latent space is omitted.
Concept Injection consists of two parts: (1) Feature Fusion. The output feature maps of the attention layers from different models are multiplied by their corresponding masks and summed to obtain the fused feature map , which is used to replace the original feature map . (2) Mask Refinement. Segmentation maps are obtained by clustering on the self-attention, and the masks required for feature fusion are extracted from these maps.
Qualitative comparison on two concepts
Qualitative comparison on more than two concepts
More qualitative comparisons on multi-concept customization
Collage-to-Image Generation. Concept Conductor can also utilize a user-created collage as a layout reference and generate images following the given layout.
Object Placement. Concept Conductor can also replace objects in a given scene or add new objects to it.
@article{yao2024concept,
title={Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis},
author={Yao, Zebin and Feng, Fangxiang and Li, Ruifan and Wang, Xiaojie},
journal={arXiv preprint arXiv:2408.03632},
year={2024}
}