Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compositional Visual Genome (ComVG)

Updated 31 January 2026
  • ComVG is an evaluation and augmentation framework for scene graph generation that targets novel or rare triplet predictions.
  • It uses structured compositional splits (zero-shot, 10-shot, 100-shot) alongside a generative pipeline to augment the Visual Genome dataset.
  • The framework employs scene-graph perturbations (Rand, Neigh, GraphN) and adversarial training to enhance model generalization on long-tailed distributions.

Compositional Visual Genome (ComVG) refers to an evaluation and augmentation setup grounded in the Visual Genome (VG) benchmark for scene-graph generation (SGG), emphasizing compositional generalization—specifically, the ability of models to correctly predict novel or rare triplets (subject–predicate–object) unobserved or infrequent in training data. This paradigm addresses challenges encountered when the underlying data distribution is long-tailed, resulting in standard SGG models being conditioned to frequent compositions but failing on zero- and few-shot combinations, e.g., ⟨cup, on, surfboard⟩, despite all constituent elements being individually frequent. ComVG provides structured splits, leveraging generative augmentations to mitigate these limitations and improve generalization (Knyazev et al., 2020).

1. Compositional Splits in Visual Genome

The ComVG framework employs the canonical Visual Genome dataset, restricted to 150 object classes and 50 predicate classes. Scene graphs extracted from images are organized according to their triplet composition frequency in the training set, and three compositional evaluation splits are defined:

  • Zero-shot (ZS): Test images containing at least one triplet never seen in training (4,519 test images).
  • 10-shot: Images containing at least one triplet observed 1–10 times during training (9,602 test images).
  • 100-shot: Images with at least one triplet observed 11–100 times (16,528 test images).
  • All-triplets: The standard test set, containing all triplets without frequency restriction.

While objects and predicates remain frequently occurring overall, particular combinations (triplets) in ZS and few-shot splits are rare or entirely absent, making this challenge strictly compositional rather than simply low-resource.

2. Generative Compositional Augmentation Pipeline

To increase compositional diversity without explicit data collection, ComVG introduces a generative augmentation pipeline:

  • Scene-Graph Perturbations: For each real scene graph G=(O,R)\mathcal{G} = (O, R), a proportion LL (typically 20%) of object nodes are replaced using one of three schemes:
    • Rand: Replacement sampled uniformly from all object classes.
    • Neigh: Replacement sampled from the top-kk (k=10k=10) closest word-embedding (GloVe-based) neighbors.
    • GraphN: Per-edge sampling from the empirical distribution of triplet counts meeting (oc,e,oj)(o_c, e, o_j) or (oj,e,oc)(o_j, e, o_c) patterns, thresholded by α\alpha, and followed by top-kk semantic neighbor selection (k=5k=5). Higher α\alpha favors more frequent compositions, lower values generate rarer combinations.
  • Bounding Box Protocol: Ground-truth bounding boxes BB are retained (B^=B\hat B = B), as learning to predict B^\hat B from perturbed graphs yields poor IoU (∼6% on ZS) and degrades SGG accuracy.
  • Feature Synthesis: Instead of geometric pixel generation, the pipeline produces frozen detector RoI features (VV, EE), remapping perturbed scene graphs into a global feature map H^\hat H using "GraphTripleConv"-based GCNs and Upsample-Refine blocks.

Three discriminators—DnodeD_\text{node}, DedgeD_\text{edge} (class label-conditioned), DglobalD_\text{global} (unconditional on H^\hat H)—jointly ensure feature realism.

3. Learning Objectives and Training Regime

ComVG's generative framework is optimized using the following loss components:

  • Scene-Graph Classification Loss (LCLS\mathcal{L}_{\rm CLS}):

LCLS=LO(F(V,E),O)+LR(F(V,E),R)\mathcal{L}_{\rm CLS} = \mathcal{L}^O(F(V,E), O) + \mathcal{L}^R(F(V,E), R)

where FF denotes the SGG model (IMP++ or Motifs++) and the second term implements density-normalized edge cross-entropy.

  • Reconstruction Loss (LREC\mathcal{L}_{\rm REC}): Only applied on synthesized features.

LREC=LO(F(V^,E^),O^)+LR(F(V^,E^),R)\mathcal{L}_{\rm REC} = \mathcal{L}^O(F(\hat V, \hat E), \hat O) + \mathcal{L}^R(F(\hat V, \hat E), R)

  • Conditional Adversarial Losses: For node (VV), edge (EE), and global (HH) streams:

LADVD(x,y)=Expdata[logD(xy)]+Ezpz[log(1D(G(z,y)y))]\mathcal{L}^D_{\rm ADV}(x, y) = \mathbb{E}_{x \sim p_\text{data}} [\log D(x|y)] + \mathbb{E}_{z \sim p_z} [\log(1 - D(G(z, y)|y))]

LADVG(y)=Ezpz[logD(G(z,y)y)]\mathcal{L}^G_{\rm ADV}(y) = \mathbb{E}_{z \sim p_z} [\log D(G(z, y)|y)]

Summed over all streams.

  • Total Optimization Target:

minF,GmaxDLCLS+LRECγ(LADVD+LADVG)\underset{F, G}{\min} \, \underset{D}{\max} \quad \mathcal{L}_{\rm CLS} + \mathcal{L}_{\rm REC} - \gamma (\mathcal{L}^D_{\rm ADV} + \mathcal{L}^G_{\rm ADV})

with γ=5\gamma = 5 controlling the GAN versus classification trade-off.

Training alternates between updating the SGG model FF on real and synthesized features, discriminator training to distinguish true versus generated features, and generator training to fool discriminators. The detector remains frozen (Faster-RCNN backbone; only features are synthesized).

4. Results and Comparative Evaluation

Empirical results, primarily reported with IMP++ as the SGG backbone, indicate marginal but consistent improvements in compositional generalization metrics. Table 1 summarizes main SGG results (Recall@100 for SGCls, Recall@50 for PredCls):

Model ZS-SGCls ZS-PredCls 10-SGCls 10-PredCls 100-SGCls 100-PredCls All-SGCls All-PredCls
IMP++ baseline 9.27 28.14 21.8 42.8 40.4 67.8 48.7 77.5
+ GAN (no perturbs) 9.25 28.66 22.2 43.7 41.6 69.2 50.4 79.1
+ GAN+Rand 9.71 28.7 21.9 43.3 41.0 68.9 49.8 78.8
+ GAN+Neigh 9.65 28.7 21.9 43.8 41.3 69.1 50.0 78.9
+ GAN+GraphN (α\alpha=2) 9.89 28.9 22.0 43.8 41.2 69.2 50.1 79.0

The GAN+GraphN augmentation provides the strongest gains across ZS, few-shot, and all-triplet splits (ZS PredCls: 28.1→28.9), with comparable or better performance versus contemporaneous baselines such as TDE [tang2020unbiased] and energy-based models.

5. Ablation Studies, Limitations, and Future Trajectories

A series of ablations confirms the importance of global adversarial and reconstruction losses; their removal degrades both SGG performance and GAN-based feature quality, assessed via Precision/Recall/Density/Coverage [kynkaanniemi2019improved, naeem2020reliable]. t-SNE visualization reveals generated features are visually consistent with real ones; however, fidelity under rare (ZS) triplet conditioning drops by ∼13%. Bounding box prediction for synthesized graphs produces insufficient IoU (∼6% on ZS), motivating retention of ground-truth boxes in augmentations.

Freezing the detector constrains the system but ensures stable feature representations. The authors note that an end-to-end, fully differentiable pipeline—scene-graph to image to detector—would enable fine-tuning on rare compositions, but this approach is computationally demanding.

Synthesizing convincing features for genuinely novel graph structures remains challenging. Advanced GAN objectives or diffusion-based alternatives are cited as possible future enhancements.

6. Context and Significance in Scene-Graph Generation

ComVG addresses a central limitation in SGG: inadequate generalization to compositional tail distributions. Standard models (e.g., Motifs++, IMP++) are overfit to the head of the triplet distribution. By augmenting the tail with hallucinated, plausible scene graphs—via data-driven perturbation schemes and conditional GANs—ComVG achieves improved recall on zero- and few-shot metrics. This framework points towards more robust scene understanding, especially where direct triplet annotations are scarce or impractical, and lays critical groundwork for advancing vision-LLMs in settings confronted by combinatorial composition sparsity (Knyazev et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compositional Visual Genome (ComVG).