Papers
Topics
Authors
Recent
Search
2000 character limit reached

SG-Adapter: Scene Graph Integration

Updated 21 February 2026
  • SG-Adapter is a transformer-based module that refines text embeddings by leveraging scene graphs to enforce correct relational bindings in image synthesis.
  • It employs token-to-triplet parsing, scene graph embedding, and cross-attention mechanisms to integrate explicit structural constraints from captions.
  • Empirical results show significant improvements in multi-relational metrics (e.g., SG-IoU, FID) over baselines, ensuring images accurately reflect complex textual descriptions.

The Scene Graph Adapter (SG-Adapter) is a transformer-based intermediary module designed to refine text embeddings in text-to-image diffusion models by leveraging the explicit relational structure of scene graphs. Introduced as a plug-in inserted after the frozen CLIP text encoder in pipelines such as Stable Diffusion, its primary purpose is to inject scene-level structural constraints and correct "relation leakage" that arises from causal attention in standard sequential encoders, especially in scenarios involving multiple objects and relationships (Shen et al., 2024).

1. Motivation and Problem Definition

In text-to-image diffusion pipelines, semantic misalignment often occurs when complex sentences with multiple subject–relation–object triplets are encoded as sequences. This sequential (autoregressive) attention causes relations or attributes (e.g., "holding a cake") to flow onto adjacent, unrelated entities (e.g., a nearby "woman"), resulting in the generation of images that mis-bind objects and relations. Scene graphs, which represent structured sets of predicate triplets, provide a non-fully connected, explicit relational context, making them natural, fine-grained control signals to enforce correct relational bindings during image synthesis. The SG-Adapter is designed to correct these failures without retraining the large backbone or sacrificing generation fidelity.

2. Architecture and Mechanism

The SG-Adapter operates in four principal stages outlined below:

2.1 Token-to-Triplet Parsing

Given a caption containing NN tokens, K semantic triplets

Tk=sk,rk,ok\mathcal{T}_k=\langle s_k, r_k, o_k \rangle

are extracted using an NLP parser or GPT-4. Each token index ii is mapped to its corresponding triplet via a deterministic map τ(i){1,,K}\tau(i)\in\{1,\dots,K\}.

2.2 Scene Graph Embedding

For each triplet Tk\mathcal{T}_k, the frozen CLIP text encoder ETE_T is individually applied to the subject, relation, and object. The resulting embeddings are concatenated and projected back to the original CLIP dimension DD: ek=l(concat(ET(sk),ET(rk),ET(ok))),eRK×D\mathbf{e}_k = l\bigl( \mathrm{concat}( E_T(s_k), E_T(r_k), E_T(o_k) ) \bigr), \quad \mathbf{e} \in \mathbb{R}^{K \times D}

2.3 Cross-Attention Adapter

Let w=ET(c)RN×D\mathbf{w}=E_T(c)\in\mathbb{R}^{N\times D} denote the sequence of original token embeddings. The adapter ff is a shallow transformer containing a single cross-attention layer. Queries Q\mathbf{Q}, keys K\mathbf{K}, and values V\mathbf{V} are produced by linear projections: Q=lQ(w),K=lK(e),V=lV(e)\mathbf{Q}=l_Q(\mathbf{w}),\quad \mathbf{K}=l_K(\mathbf{e}),\quad \mathbf{V}=l_V(\mathbf{e}) with attention computed as: w=f(w,e,Msg)=Attention(Q,K,V,Msg)\mathbf{w}' = f(\mathbf{w}, \mathbf{e}, \mathbf{M}^{\mathrm{sg}}) = \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}, \mathbf{M}^{\mathrm{sg}}) where

Attention(Q,K,V,M)=softmax(QKTdk+M)V\mathrm{Attention}(Q,K,V,M) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M \right)V

2.4 Token–Triplet Attention Mask

Attention is restricted so that each token attends only to its own triplet: Mi,ksg={0,τ(i)=k ,otherwiseM^{\mathrm{sg}}_{i,k}= \begin{cases} 0, & \tau(i) = k \ -\infty, & \text{otherwise} \end{cases} Replacing the intrinsic causal mask in CLIP with an "intra-triplet" mask was found to degrade image quality, so this approach is used only in the adapter, not in the base CLIP encoder.

3. Training Procedure and Objective

The SG-Adapter is trained while keeping both the CLIP encoder and the diffusion model backbone frozen. For each image-caption pair (x,c)(x, c), the adapter produces refined text embeddings w\mathbf{w}'. The objective is to minimize the denoising error at every diffusion step tt: Lt=Ex,t,ϵ[ϵtϵθ(xt,t,w)22]\mathcal{L}_t = \mathbb{E}_{x, t, \epsilon} \left[ \| \epsilon_t - \epsilon_\theta(x_t, t, \mathbf{w}') \|_2^2 \right] where ϵt\epsilon_t is the ground truth noise and ϵθ()\epsilon_\theta(\cdot) is the model's predicted noise, conditioned on the scene-graph-refined embeddings. This formulation encourages precise propagation of scene graph structure into the image generation process (Shen et al., 2024).

4. MultiRels Dataset Construction

To address the paucity and low quality of multi-relational scene graph datasets such as Visual Genome, the MultiRels dataset was curated. It consists of 309 images partitioned as follows:

  • ReVersion Subset: 99 samples focusing on challenging single relations sourced from the ReVersion dataset.
  • Multiple-Relations Subset: 210 images derived from two sources:
    • 40 images crawled with long-query retrieval.
    • 170 volunteer-collected photos, with up to 4 salient relations per image (objects include people, fruits, furniture, etc.). Annotations consist of scene graphs (lists of triplets) and token-to-triplet mappings. Faces were locally redrawn for privacy.
  • Test Split: 20 scenarios generated by regrouping 2–3 relations to support robust evaluation.

This curated dataset supports fine-grained, multi-relational evaluation in diverse visual contexts.

5. Evaluation Metrics and Empirical Results

GPT-4V is used for automated evaluation by extracting predicted triplets and entities from generated images. Three primary intersection-over-union (IoU) based metrics are defined:

  • Scene Graph IoU (SG-IoU): Measures correct prediction of entire triplets.
  • Entity-IoU: Evaluates correct occurrence of object entities.
  • Relation-IoU: Assesses correct occurrence of relational predicates.

On the MultiRels test set, SG-Adapter yields the following (Table 1 in (Shen et al., 2024)):

Method SG-IoU Entity-IoU Relation-IoU Rel-Acc Ent-Acc FID
SG-Adapter (ours) 0.623 0.812 0.753 77.6% 77.1% 26.2
Stable Diffusion [46] 0.157 0.673 0.526 5.38% 5.48% 25.0
Finetune CLIP 0.198 0.499 0.635 5.38% 6.78% 58.2
GLIGEN Adapter 0.141 0.689 0.546 5.72% 5.58% 27.4
LoRA Adapter 0.145 0.653 0.540 5.96% 5.05% 27.5

Replacing the scene-graph attention mask with a non-masked variant yields a marked drop in SG-IoU (0.623 → 0.316), confirming the necessity of explicit triplet-level binding.

SG-Adapter was further compared to scene-graph-to-image architectures (SG2IM, PasteGAN, SGDiff, SceneGenie), achieving lower FID (25.1 vs. ≥36.2) and higher IS (57.8 vs. ≤21.5).

Qualitatively, SG-Adapter consistently places objects and relations as dictated by the scene graph, eliminating mis-binding seen in baselines for multi-relation prompts such as “a man holds a cake and a woman holds an apple” (Figures 7 and 9 in (Shen et al., 2024)).

6. Implementation Specifics

SG-Adapter was developed for Stable Diffusion v2.1 at 768×768 image size. Training employed AdamW optimization, a learning rate of 1×1051 \times 10^{-5}, batch size 4, and convergence in 12K–14K iterations on a single NVIDIA A100 GPU. During inference, SG-guided cross-attention is applied for the initial τT\tau T diffusion steps (with τ=0.3\tau=0.3 balancing relation control and image fidelity) before reverting to standard CLIP guidance for the remainder.

7. Limitations and Prospects

Several constraints are noted:

  • Dataset Anonymization: Manual redrawing of faces may introduce artifacts. More advanced anonymization is proposed for future iterations.
  • Scene Graph Scale: MultiRels is relatively small. Scaling up may require synthetic augmentation or improved semi-automated cleaning of large datasets (e.g., Visual Genome).
  • Expressivity Beyond Triplets: The current formulation binds tokens only to flat relational triplets. Incorporating deeper, hierarchical scene graphs (n-ary relations, attributes, nested subgraphs) is posited as a direction for augmenting the grounding capability in complex scenes.

SG-Adapter demonstrates an effective approach for instilling explicit structural priors in pretrained text-to-image diffusion pipelines, achieving substantial improvements in multi-relation correctness while maintaining synthesis quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SG-Adapter.