Compositional Visual Genome (ComVG)
- ComVG is an evaluation and augmentation framework for scene graph generation that targets novel or rare triplet predictions.
- It uses structured compositional splits (zero-shot, 10-shot, 100-shot) alongside a generative pipeline to augment the Visual Genome dataset.
- The framework employs scene-graph perturbations (Rand, Neigh, GraphN) and adversarial training to enhance model generalization on long-tailed distributions.
Compositional Visual Genome (ComVG) refers to an evaluation and augmentation setup grounded in the Visual Genome (VG) benchmark for scene-graph generation (SGG), emphasizing compositional generalization—specifically, the ability of models to correctly predict novel or rare triplets (subject–predicate–object) unobserved or infrequent in training data. This paradigm addresses challenges encountered when the underlying data distribution is long-tailed, resulting in standard SGG models being conditioned to frequent compositions but failing on zero- and few-shot combinations, e.g., ⟨cup, on, surfboard⟩, despite all constituent elements being individually frequent. ComVG provides structured splits, leveraging generative augmentations to mitigate these limitations and improve generalization (Knyazev et al., 2020).
1. Compositional Splits in Visual Genome
The ComVG framework employs the canonical Visual Genome dataset, restricted to 150 object classes and 50 predicate classes. Scene graphs extracted from images are organized according to their triplet composition frequency in the training set, and three compositional evaluation splits are defined:
- Zero-shot (ZS): Test images containing at least one triplet never seen in training (4,519 test images).
- 10-shot: Images containing at least one triplet observed 1–10 times during training (9,602 test images).
- 100-shot: Images with at least one triplet observed 11–100 times (16,528 test images).
- All-triplets: The standard test set, containing all triplets without frequency restriction.
While objects and predicates remain frequently occurring overall, particular combinations (triplets) in ZS and few-shot splits are rare or entirely absent, making this challenge strictly compositional rather than simply low-resource.
2. Generative Compositional Augmentation Pipeline
To increase compositional diversity without explicit data collection, ComVG introduces a generative augmentation pipeline:
- Scene-Graph Perturbations: For each real scene graph , a proportion (typically 20%) of object nodes are replaced using one of three schemes:
- Rand: Replacement sampled uniformly from all object classes.
- Neigh: Replacement sampled from the top- () closest word-embedding (GloVe-based) neighbors.
- GraphN: Per-edge sampling from the empirical distribution of triplet counts meeting or patterns, thresholded by , and followed by top- semantic neighbor selection (). Higher favors more frequent compositions, lower values generate rarer combinations.
- Bounding Box Protocol: Ground-truth bounding boxes are retained (), as learning to predict from perturbed graphs yields poor IoU (∼6% on ZS) and degrades SGG accuracy.
- Feature Synthesis: Instead of geometric pixel generation, the pipeline produces frozen detector RoI features (, ), remapping perturbed scene graphs into a global feature map using "GraphTripleConv"-based GCNs and Upsample-Refine blocks.
Three discriminators—, (class label-conditioned), (unconditional on )—jointly ensure feature realism.
3. Learning Objectives and Training Regime
ComVG's generative framework is optimized using the following loss components:
- Scene-Graph Classification Loss ():
where denotes the SGG model (IMP++ or Motifs++) and the second term implements density-normalized edge cross-entropy.
- Reconstruction Loss (): Only applied on synthesized features.
- Conditional Adversarial Losses: For node (), edge (), and global () streams:
Summed over all streams.
- Total Optimization Target:
with controlling the GAN versus classification trade-off.
Training alternates between updating the SGG model on real and synthesized features, discriminator training to distinguish true versus generated features, and generator training to fool discriminators. The detector remains frozen (Faster-RCNN backbone; only features are synthesized).
4. Results and Comparative Evaluation
Empirical results, primarily reported with IMP++ as the SGG backbone, indicate marginal but consistent improvements in compositional generalization metrics. Table 1 summarizes main SGG results (Recall@100 for SGCls, Recall@50 for PredCls):
| Model | ZS-SGCls | ZS-PredCls | 10-SGCls | 10-PredCls | 100-SGCls | 100-PredCls | All-SGCls | All-PredCls |
|---|---|---|---|---|---|---|---|---|
| IMP++ baseline | 9.27 | 28.14 | 21.8 | 42.8 | 40.4 | 67.8 | 48.7 | 77.5 |
| + GAN (no perturbs) | 9.25 | 28.66 | 22.2 | 43.7 | 41.6 | 69.2 | 50.4 | 79.1 |
| + GAN+Rand | 9.71 | 28.7 | 21.9 | 43.3 | 41.0 | 68.9 | 49.8 | 78.8 |
| + GAN+Neigh | 9.65 | 28.7 | 21.9 | 43.8 | 41.3 | 69.1 | 50.0 | 78.9 |
| + GAN+GraphN (=2) | 9.89 | 28.9 | 22.0 | 43.8 | 41.2 | 69.2 | 50.1 | 79.0 |
The GAN+GraphN augmentation provides the strongest gains across ZS, few-shot, and all-triplet splits (ZS PredCls: 28.1→28.9), with comparable or better performance versus contemporaneous baselines such as TDE [tang2020unbiased] and energy-based models.
5. Ablation Studies, Limitations, and Future Trajectories
A series of ablations confirms the importance of global adversarial and reconstruction losses; their removal degrades both SGG performance and GAN-based feature quality, assessed via Precision/Recall/Density/Coverage [kynkaanniemi2019improved, naeem2020reliable]. t-SNE visualization reveals generated features are visually consistent with real ones; however, fidelity under rare (ZS) triplet conditioning drops by ∼13%. Bounding box prediction for synthesized graphs produces insufficient IoU (∼6% on ZS), motivating retention of ground-truth boxes in augmentations.
Freezing the detector constrains the system but ensures stable feature representations. The authors note that an end-to-end, fully differentiable pipeline—scene-graph to image to detector—would enable fine-tuning on rare compositions, but this approach is computationally demanding.
Synthesizing convincing features for genuinely novel graph structures remains challenging. Advanced GAN objectives or diffusion-based alternatives are cited as possible future enhancements.
6. Context and Significance in Scene-Graph Generation
ComVG addresses a central limitation in SGG: inadequate generalization to compositional tail distributions. Standard models (e.g., Motifs++, IMP++) are overfit to the head of the triplet distribution. By augmenting the tail with hallucinated, plausible scene graphs—via data-driven perturbation schemes and conditional GANs—ComVG achieves improved recall on zero- and few-shot metrics. This framework points towards more robust scene understanding, especially where direct triplet annotations are scarce or impractical, and lays critical groundwork for advancing vision-LLMs in settings confronted by combinatorial composition sparsity (Knyazev et al., 2020).