Concept Conductor

Concept Conductor:
Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

AAAI 2025

Zebin Yao, Fangxiang Feng, Ruifan Li, Xiaojie Wang

Beijing University of Posts and Telecommunications

Abstract

The customization of text-to-image models has seen significant advancements, yet generating multiple personalized concepts remains a challenging task. Current methods struggle with attribute leakage and layout confusion when handling multiple concepts, leading to reduced concept fidelity and semantic consistency.

In this work, we introduce a novel training-free framework, Concept Conductor, designed to ensure visual fidelity and correct layout in multi-concept customization. Concept Conductor isolates the sampling processes of multiple custom models to prevent attribute leakage between different concepts and corrects erroneous layouts through self-attention-based spatial guidance. Additionally, we present a concept injection technique that employs shape-aware masks to specify the generation area for each concept. This technique injects the structure and appearance of personalized concepts through feature fusion in the attention layers, ensuring harmony in the final image.

Extensive qualitative and quantitative experiments demonstrate that Concept Conductor can consistently generate composite images with accurate layouts while preserving the visual details of each concept. Compared to existing baselines, Concept Conductor shows significant performance improvements. Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts.

Attribute Leakage and Layout Confusion

Layout Confusion in Multi-Concept Customization

(a) Attribute leakage denotes the application of one concept’s attributes to another (e.g., a cat acquiring the fur and eyes of a dog).

(b) Concept omission indicates one or more target concepts not appearing in the image (e.g., the absence of the target cat).

(c) Subject redundancy refers to the appearance of extra subjects similar to the target concept (e.g., an extra cat).

(d) Appearance truncation signifies the target concept’s appearance being observed only in a partial area of the subject (e.g., the upper half of a dog and the lower half of a cat).

Overview of Concept Conductor

Our method comprises three key components: Multipath Sampling, Layout Alignment, and Concept Injection. At each denoising step, the input latent vector $z_{t}$ is first corrected to $z_{t}$ ′ by the Layout Alignment module. $z_{t}$ ′ is then sent to the Concept Injection module for denoising. Both Layout Alignment and Concept Injection utilize the Multipath Sampling structure. After denoising, our method can generate images that align with the given text prompt and visual concepts.

Multipath Sampling

Custom models $ϵ_{θ}^{V_{1}}$ and $ϵ_{θ}^{V_{2}}$ are created by adding ED-LoRA to the base model $ϵ_{θ}^{base}$ . The base prompt and edited prompts are sent to the base model and custom models, respectively. Different models receive the same latent input $z_{t}$ and predict different noises. Self-attention features $F_{t}^{base}$ , $F_{t}^{V_{1}}$ , $F_{t}^{V_{2}}$ , and the output feature maps of the attention layers $h_{t}^{base}$ , $h_{t}^{V_{1}}$ , $h_{t}^{V_{2}}$ are recorded during this process.

Layout Alignment

The self-attention feature $F_{t}^{ref}$ of the layout reference image is extracted through DDIM inversion, which is then used to compute the loss with $F_{t}^{base}$ , $F_{t}^{V_{1}}$ and $F_{t}^{V_{2}}$ , updating the input latent vector $z_{t}$ . For simplicity, the conversion from pixel space to latent space is omitted.

Concept Injection

Concept Injection consists of two parts: (1) Feature Fusion. The output feature maps of the attention layers from different models are multiplied by their corresponding masks and summed to obtain the fused feature map $h_{t}$ , which is used to replace the original feature map $h_{t}^{base}$ . (2) Mask Refinement. Segmentation maps are obtained by clustering on the self-attention, and the masks required for feature fusion are extracted from these maps.

Further Applications

Collage-to-Image Generation. Concept Conductor can also utilize a user-created collage as a layout reference and generate images following the given layout.

Object Placement. Concept Conductor can also replace objects in a given scene or add new objects to it.

BibTeX

@article{yao2024concept, title={Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis}, author={Yao, Zebin and Feng, Fangxiang and Li, Ruifan and Wang, Xiaojie}, journal={arXiv preprint arXiv:2408.03632}, year={2024} }

Concept Conductor:
Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

AAAI 2025

We propose Concept Conductor, a novel inference framework for multi-concept customization.

Abstract

Attribute Leakage and Layout Confusion

Overview of Concept Conductor

Multipath Sampling

Layout Alignment

Concept Injection

Qualitative Comparison

Further Applications

BibTeX

Concept Conductor:Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

AAAI 2025

We propose Concept Conductor, a novel inference framework for multi-concept customization.

Abstract

Attribute Leakage and Layout Confusion

Overview of Concept Conductor

Multipath Sampling

Layout Alignment

Concept Injection

Qualitative Comparison

Further Applications

BibTeX

Concept Conductor:
Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis