Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IP-Composer: Semantic Composition of Visual Concepts (2502.13951v1)

Published 19 Feb 2025 in cs.CV and cs.GR

Abstract: Content creators often draw inspiration from multiple visual sources, combining distinct elements to craft new compositions. Modern computational approaches now aim to emulate this fundamental creative process. Although recent diffusion models excel at text-guided compositional synthesis, text as a medium often lacks precise control over visual details. Image-based composition approaches can capture more nuanced features, but existing methods are typically limited in the range of concepts they can capture, and require expensive training procedures or specialized data. We present IP-Composer, a novel training-free approach for compositional image generation that leverages multiple image references simultaneously, while using natural language to describe the concept to be extracted from each image. Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding. We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text. Through comprehensive evaluation, we show that our approach enables more precise control over a larger range of visual concept compositions.

Summary

  • The paper presents a training-free method that leverages a pre-trained text-to-image diffusion model with an image condition input to composite visual concepts.
  • It uses CLIP-based subspace identification and SVD to construct projection matrices, ensuring precise control and reducing unwanted feature leakage.
  • The approach demonstrates robust performance in tasks like pattern transfer and subject insertion, validated by both quantitative metrics and user studies.

IP-Composer addresses the challenge of creating novel images by combining visual concepts from multiple source images. While text-guided diffusion models allow for compositional synthesis through language, they often lack precise control over fine visual details. Existing image-based methods can capture nuances but typically require expensive training or specialized datasets for each new concept, limiting their scalability and practicality.

The core idea behind IP-Composer is a training-free approach that leverages a pre-trained text-to-image diffusion model augmented with an image condition input, specifically building upon IP-Adapter [ye2023ipadapter]. The method relies on the observation that CLIP's embedding space contains semantic subspaces tied to different visual concepts. IP-Composer aims to identify these concept-specific subspaces and then create a composite embedding by selectively taking projections from different source images onto their respective concept subspaces.

The process involves several steps:

  1. Concept Subspace Identification: For each desired concept (e.g., "outfit," "pattern," "age"), a set of texts describing variations of that concept is generated. This is practically achieved by prompting a LLM to produce a diverse list of descriptions (e.g., 150 to 500 prompts depending on concept variability).
  2. Subspace Projection Matrix Construction: The generated texts are encoded using the CLIP text encoder (CLIPtCLIP_t). The resulting text embeddings are arranged into a matrix EE. Singular Value Decomposition (SVD) is applied to E=UΣVTE = U\Sigma V^T. The top rr right singular vectors from VV are selected, forming VrV_r. These vectors span the estimated concept subspace. The projection matrix PcP_c for concept cc is computed as Pc=VrTVrP_c = V_r^T V_r. The rank rr is chosen empirically, often a default value like 30 for concepts like outfit replacement or 120 for more varied concepts like patterns, though it can be tuned for specific tasks.
  3. Composite Embedding Creation: Given a reference image IrefI_{\text{ref}} (typically providing the base scene or subject) and one or more concept images IckI_{c_k} (each providing a specific instance of concept ckc_k), their CLIP image embeddings eref\mathbf{e}_{\text{ref}} and eck\mathbf{e}_{c_k} are obtained. A composite embedding ecomp\mathbf{e}_{\text{comp}} is constructed by starting with the reference embedding, subtracting its projections onto the concept subspaces, and adding the corresponding projections from the concept images:

    ecomp=erefk=1KPckeref+k=1KPckeck\mathbf{e}_{\text{comp}} = \mathbf{e}_{\text{ref}} - \sum_{k=1}^K P_{c_k} \mathbf{e}_{\text{ref}} + \sum_{k=1}^K P_{c_k} \mathbf{e}_{c_k}

    For multiple concepts, the projections are added sequentially without subtracting cross-concept projections.

  4. Image Generation: The composite embedding ecomp\mathbf{e}_{\text{comp}} is then used as the image condition input for a pre-trained IP-Adapter model (such as one based on SDXL and OpenCLIP-ViT-H-14), along with an optional text prompt, to generate the final composed image.

IP-Composer demonstrates practical applicability across a variety of compositional tasks, including transferring patterns, modifying outfits, changing age or emotion on a face, altering lighting, inserting objects (like vehicles or dogs) into scenes, and transferring materials or fur textures. The method allows for composing concepts from multiple (more than two) images simultaneously, though performance can be affected by the dimensionality of the embedding space and the complexity of the concepts involved. The approach is training-free for new concepts, requiring only the generation of descriptive texts and computation of projection matrices.

The paper provides both qualitative and quantitative evaluations. Qualitatively, IP-Composer is shown to produce results that successfully combine elements from different sources, often outperforming baselines like pOps [richardson2024popsphotoinspireddiffusionoperators], ProSpect [zhang2023prospectpromptspectrumattributeaware], and a simple "Describe and Compose" method. Compared to training-based methods like pOps, IP-Composer achieves comparable quality on specific tasks (like subject insertion) without requiring large, task-specific datasets or model tuning. Compared to optimization-based methods like ProSpect or text-based methods, IP-Composer offers better control and less unwanted feature leakage. Quantitative analysis, using CLIP-space distance metrics to measure concept similarity and leakage, supports these findings. A user paper confirms that results from IP-Composer are significantly preferred by users compared to baseline methods.

An ablation paper explores alternative methods for combining IP-Adapter embeddings, such as concatenation and interpolation, as well as using images instead of text to span concept subspaces. These ablations show that IP-Composer's subspace projection method is superior in terms of reducing leakage and providing specific concept control. The paper also considers a multi-step generation process for composing many concepts, where images are generated incrementally, which can sometimes reduce leakage but may also lose fine details.

Implementation considerations include the choice of the base diffusion model, the IP-Adapter backbone, the LLM for generating concept descriptions, and the empirical selection of the SVD rank rr. Computational requirements involve encoding images and texts with CLIP, performing SVD, and running the diffusion process with the composite embedding. This is typically less computationally intensive than per-concept training or lengthy per-image optimization.

Limitations discussed include unexpected concept entanglement in CLIP/diffusion spaces, which can lead to unintended feature combinations (e.g., combining zebra body and leopard pattern resulting in giraffe-like features). Another limitation is that some concepts intuitively thought to be entangled (like outfit shape and color) may be more disentangled in CLIP, requiring more specific text prompts to capture all desired attributes. Finally, the method inherits limitations from the underlying IP-Adapter and diffusion model, such as difficulty in preserving exact identity or very fine-grained details.

X Twitter Logo Streamline Icon: https://streamlinehq.com