Concept Conductor:
Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

AAAI 2025

Zebin Yao,  Fangxiang Feng,  Ruifan Li,  Xiaojie Wang
Beijing University of Posts and Telecommunications
teaser

We propose Concept Conductor, a novel inference framework for multi-concept customization.

Abstract

The customization of text-to-image models has seen significant advancements, yet generating multiple personalized concepts remains a challenging task. Current methods struggle with attribute leakage and layout confusion when handling multiple concepts, leading to reduced concept fidelity and semantic consistency.

In this work, we introduce a novel training-free framework, Concept Conductor, designed to ensure visual fidelity and correct layout in multi-concept customization. Concept Conductor isolates the sampling processes of multiple custom models to prevent attribute leakage between different concepts and corrects erroneous layouts through self-attention-based spatial guidance. Additionally, we present a concept injection technique that employs shape-aware masks to specify the generation area for each concept. This technique injects the structure and appearance of personalized concepts through feature fusion in the attention layers, ensuring harmony in the final image.

Extensive qualitative and quantitative experiments demonstrate that Concept Conductor can consistently generate composite images with accurate layouts while preserving the visual details of each concept. Compared to existing baselines, Concept Conductor shows significant performance improvements. Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts.

Attribute Leakage and Layout Confusion

Layout Confusion in Multi-Concept Customization

(a) Attribute leakage denotes the application of one concept’s attributes to another (e.g., a cat acquiring the fur and eyes of a dog).

(b) Concept omission indicates one or more target concepts not appearing in the image (e.g., the absence of the target cat).

(c) Subject redundancy refers to the appearance of extra subjects similar to the target concept (e.g., an extra cat).

(d) Appearance truncation signifies the target concept’s appearance being observed only in a partial area of the subject (e.g., the upper half of a dog and the lower half of a cat).


Overview of Concept Conductor

Overview of Concept Conductor

Our method comprises three key components: Multipath Sampling, Layout Alignment, and Concept Injection. At each denoising step, the input latent vector zt is first corrected to zt′ by the Layout Alignment module. zt′ is then sent to the Concept Injection module for denoising. Both Layout Alignment and Concept Injection utilize the Multipath Sampling structure. After denoising, our method can generate images that align with the given text prompt and visual concepts.


Multipath Sampling

Multipath Sampling

Custom models ϵθV1 and ϵθV2 are created by adding ED-LoRA to the base model ϵθbase. The base prompt and edited prompts are sent to the base model and custom models, respectively. Different models receive the same latent input zt and predict different noises. Self-attention features Ftbase, FtV1, FtV2, and the output feature maps of the attention layers htbase, htV1, htV2 are recorded during this process.


Layout Alignment

Layout Alignment

The self-attention feature Ftref of the layout reference image is extracted through DDIM inversion, which is then used to compute the loss with Ftbase, FtV1 and FtV2, updating the input latent vector zt. For simplicity, the conversion from pixel space to latent space is omitted.


Concept Injection

Concept Injection

Concept Injection consists of two parts: (1) Feature Fusion. The output feature maps of the attention layers from different models are multiplied by their corresponding masks and summed to obtain the fused feature map ht, which is used to replace the original feature map htbase. (2) Mask Refinement. Segmentation maps are obtained by clustering on the self-attention, and the masks required for feature fusion are extracted from these maps.


Qualitative Comparison

qualitative results

Qualitative comparison on two concepts


qualitative results

Qualitative comparison on more than two concepts


qualitative results

More qualitative comparisons on multi-concept customization


Further Applications

further applications

Collage-to-Image Generation. Concept Conductor can also utilize a user-created collage as a layout reference and generate images following the given layout.


further applications

Object Placement. Concept Conductor can also replace objects in a given scene or add new objects to it.


BibTeX


      @article{yao2024concept,
        title={Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis},
        author={Yao, Zebin and Feng, Fangxiang and Li, Ruifan and Wang, Xiaojie},
        journal={arXiv preprint arXiv:2408.03632},
        year={2024}
      }