Scene Graph Generation Insights

Updated 22 November 2025

Scene graph generation is a process that converts images into structured graphs with nodes for objects and edges for relationships.
It employs diverse methodologies such as two-stage pipelines, graph neural networks, and transformers to incorporate context and refine predictions.
Applications span visual question answering, image captioning, and navigation, while challenges include long-tail imbalance and spatial grounding precision.

Scene graph generation (SGG) is the task of converting visual input, most commonly an image, into a structured representation that explicates objects (as graph nodes) and their relationships (as directed, labeled edges). This graph-based abstraction provides a compact, machine-readable summary of the scene’s semantic and geometric content, directly serving downstream applications in vision-language modeling, visual question answering, navigation, image captioning, and beyond (Zhu et al., 2022, Mozes et al., 2021, Seymour et al., 2022). The field has evolved rapidly, moving from object-centric, bounding-box–grounded pipelines to relation-centric, iterative, and even modality-universal graph extraction frameworks.

1. Problem Definition and Canonical Formulation

Let a visual scene $S$ be represented by a directed graph $G_S = (V_S, E_S)$ where $V_S = \{o_{S,1}, \ldots, o_{S,n}\}$ are semantic object instances—each with class label $l_{S,i}$ and often bounding-box location $b_{S,i}$ —and $E_S = \{ (o_{S,i},p_{S,i\to j},o_{S,j}) \}$ is the set of object–predicate–object triplets for $i\neq j$ (Zhu et al., 2022). Predicates $p_{S,i\to j}$ are drawn from a discrete, pre-defined set (typically 20–50 categories such as “on”, “holding”, “carrying”).

The standard probabilistic approach factorizes scene graph generation as: $p(G_S | S) = p(\mathcal{B}_S | S)\, p(\mathcal{O}_S | \mathcal{B}_S, S)\, p(\mathcal{R}_S | \mathcal{O}_S, \mathcal{B}_S, S)$ Here, $\mathcal{B}_S$ are region proposals, $\mathcal{O}_S$ are object labels, and $\mathcal{R}_S$ are predicates (Zhu et al., 2022, Liu et al., 2021). The SGG task is evaluated in three sub-tasks: Predicate Classification (PredCls; objects known, predict relationships), Scene Graph Classification (SGCls; boxes known, predict both labels and relationships), and Scene Graph Detection (SGGen; fully unconstrained).

2. Approaches and Model Architectures

2.1. Two-Stage and Factorized Pipelines

Early SGG architectures adopted a two-stage paradigm: first generate object proposals and classify them, and then for each object pair, predict the predicate (Li et al., 2017, Mozes et al., 2021). Prominent examples include Faster R-CNN–based backbones (for objects), followed by MLP or LSTM-based relation heads operating on concatenated position and feature information. Notable models of this class include MotifNet (BiLSTM context), VTransE (translation embedding for relations), and KERN (knowledge-graph–enhanced message passing) (Zhu et al., 2022, Kumar et al., 2021).

2.2. Context- and Message-Passing–Driven Methods

Object-pair–wise classification scales quadratically and can ignore global context. To remedy this, graph neural networks (GNNs) and transformer-based models are used to propagate context (Zhu et al., 2022, Khandelwal et al., 2022, Kundu et al., 2022). Recent models formulate SGG as graph-level message passing: e.g., EdgeSGG explicitly constructs a dual scene graph (with edges as nodes) and applies dual message passing neural networks (object-centric and relation-centric, symmetrically coupled) (Kim et al., 2023). Iterative transformer architectures further enable “refinement” of predictions through layered, MRF-like conditioning, jointly revising subjects, objects, and predicates at each step (Khandelwal et al., 2022).

2.3. Fully Convolutional and Bottom-Up Models

To increase computational efficiency, some works replace region proposal–based detection with fully convolutional inference. FCSGG predicts object centers as heatmaps and relation vector fields (Relation Affinity Fields) directly, constructing the scene graph by localizing peaks and integrating along predicted relation fields (Liu et al., 2021). This eliminates the need for proposal-pairing and allows high-throughput inference without explicit object-ROI enumeration.

2.4. Segmentation- and Pixel-Level Grounding

Standard SGG methods ground objects and predicates to bounding boxes, which are coarse and often include background. “Segmentation-grounded Scene Graph Generation” attaches pixel-level masks to object nodes and leverages a zero-shot transfer mechanism from MS COCO via linguistic matching to provide fine-grained masks—enabling predicate features to attend to precise spatial interfaces via Gaussian attention (Khandelwal et al., 2021). Empirically, integrating segmentation substantially improves mean recall and zero-shot recall, particularly on rare predicates, and fosters better spatial and functional relation modeling.

2.5. Location-Free Scene Graph Generation

Location-Free methods forgo explicit spatial localization, generating ungrounded scene graphs directly from pixels. Pix2SG autoregressively decodes object and predicate tokens as sequences, matching predicted and ground-truth graphs via tree search (Özsoy et al., 2023). This approach achieves ∼74% of standard, box-based SGG recall on Visual Genome and recovers even higher F1 in surgical domains, reflecting that explicit localization can be omitted without catastrophic degradation in graph quality.

2.6. Multimodal and Universal Scene Graph Generation

Recent advances generalize the scene graph paradigm to handle multimodal inputs—images, video, 3D, and text—within a single, universal graph structure (USG) (Wu et al., 19 Mar 2025). USG-Par’s architecture exemplifies this direction by deploying modality-specific encoders (e.g., CLIP, Point-BERT), a shared cross-attention mask decoder, and object associators learning cross-modal correspondences via contrastive learning. USG representations unify intra- and inter-modal relationships, yielding state-of-the-art performance on both single- and multi-modal benchmarks.

3. Training Objectives, Losses, and Evaluation

SGG models are trained with combinations of the following losses (Zhu et al., 2022, Kim et al., 2023, Khandelwal et al., 2021):

Object classification: cross-entropy on predicted class labels.
Box regression: smooth L1 or similar.
Predicate classification: cross-entropy (sometimes with inverse-frequency weighting for long-tailed predicates).
Auxiliary loss: segmentation mask regression (Dice + BCE), translation embedding (margin ranking), or contrastive (for multimodal or long-tail mitigation).

Evaluation is typically by recall@K (R@K), mean recall@K (mR@K), and sometimes harmonic recall (hR@K) between overall and per-class recall (Khandelwal et al., 2022, Kundu et al., 2022) to track both head and tail predicate performance. Zero-shot recall focuses on triplets not seen in training, and OpenImages additionally reports mAP for phrase and relation detection (Kim et al., 2023).

4. Current Advancements and Specialized Methodologies

4.1. Relation-Centric and Long-Tail Strategies

Object-centric methods are limited in modeling higher-order and rare relationships. Techniques such as EdgeSGG’s dual-graph structure amplify gradients on rare predicates and spread contextual information symmetrically across relation instances (Kim et al., 2023). Class-balanced loss weighting and long-tail–focused sampling have empirically improved tail recall without relying on hand-crafted reweighting (Khandelwal et al., 2022, Kundu et al., 2022).

4.2. Iterative and Generative Transformers

Iterative SGG frameworks unroll Markov Random Field–style message passing via transformers, using both within-step and across-step conditioning to enable joint reasoning over entities and relations. This approach consistently improves mean and harmonic recall, especially for ambiguous or occluded relations (Khandelwal et al., 2022). Generative alternatives, such as IS-GGT, decouple structure sampling (autoregressive adjacency matrix generation) from predicate classification, pruning the combinatorial search space and achieving strong unbiased mean recall at lower computational cost (Kundu et al., 2022).

4.3. Continual and Incremental Scene Graph Learning

Continual SGG requires models to expand and refine graph vocabularies over time without catastrophic forgetting. CSEGG formalizes three realistic regimes (relationship-incremental, scene-incremental, and relationship-generalization) and introduces replay-via-synthesis (RAS): generated synthetic scenes (from symbolic triplet clusters) are annotated by the model for replay training (Khandelwal et al., 2023). RAS preserves memory and privacy more effectively than image-buffer–based replay, yielding lower forgetting and improved generalization on new classes or relations.

4.4. Geometric and Multimodal Integration

Spatial relations (e.g., “above”, “near”) are nearly deterministic from geometric cues. Augmenting co-occurrence–driven models such as KERN with explicit geometric binning modules—using centroid distances and direction—yields significant recall gains on spatial predicates (Kumar et al., 2021). Universal approaches, as in USG-Par, further integrate video, 3D, and text representations into holistically grounded graphs, using cross-modal attention and text-centric contrastive learning as semantic anchors (Wu et al., 19 Mar 2025).

5. Applications and Impact

Scene graph representations are leveraged in:

Visual question answering—providing structured, context-rich features (Mozes et al., 2021).
Image captioning—scene graph–to–text pipelines outperform direct image-to-text baselines by significant BLEU and METEOR margins, due to their explicit modeling of compositional object–relation structure (Mozes et al., 2021).
Visual navigation—modular scene-graph encoders (GraphMapper) in 3D environments yield higher task success, explainability, and sample efficiency via explicit topological modeling (Seymour et al., 2022).
Structure-based image synthesis, anomaly detection, and partial-graph completion using unconditional, generative models such as SceneGraphGen (Garg et al., 2021).
Multimodal knowledge representation—for example, universal scene graphs open new paths for multimodal LLM pre-training and cross-modal reasoning in embodied agents (Wu et al., 19 Mar 2025).

6. Limitations and Open Challenges

Long-tail distributions and class imbalance remain, partially addressed but not fully solved by current dual-graph and loss-reweighting techniques (Kim et al., 2023, Khandelwal et al., 2022).
Ambiguity in annotation, overlapping predicate semantics, and large intra-class variance hinder precise relation classification (Zhu et al., 2022).
Spatial grounding granularity varies—predicates may be better localized at the pixel, box, or even region-caption scale; current SGG evaluation metrics inadequately capture graph-level consistency, penalizing or rewarding certain triplet types unevenly.
Scaling to large graphs, efficient multimodal fusion, dynamic scene (video, 3D) graph construction, and continual/online adaptability present active frontiers (Wu et al., 19 Mar 2025, Khandelwal et al., 2023).
Privacy and annotation cost: location-free SGG and RAS-style replay address some obstacles, but domain transfer and annotation scarcity for new categories continue to challenge scene-graph generalization (Özsoy et al., 2023, Khandelwal et al., 2023).

7. Future Directions

Anticipated directions in scene graph generation include:

Pretraining and instruction fine-tuning of large models on USG representations for robust, multimodal semantic understanding (Wu et al., 19 Mar 2025).
Self-/weak-supervised or zero-shot SGG, leveraging language, commonsense, and cross-modal alignments to obviate extensive annotation (Khandelwal et al., 2021, Özsoy et al., 2023).
Integration of temporal memory, external knowledge bases, and explainable reasoning into graph-based models.
Standardization of 3D scene graph formats and benchmarks for generalist agents.
Expanded research on online and lifelong SGG capabilities, including privacy-preserving continual learning architectures (Khandelwal et al., 2023).

Scene graph generation has matured from box-pair–driven pipelines to highly contextual, relation-centric, and universal graph parsers, enabling a range of robust, scalable visual inference systems (Zhu et al., 2022, Kim et al., 2023, Wu et al., 19 Mar 2025).