Open-Vocabulary Scene Graph Generation

Updated 15 April 2026

Open-vocabulary scene graph generation is the task of constructing scene graphs with nodes and edges drawn from an unbounded vocabulary, enabling recognition of novel objects and relations.
Modern approaches use transformer-based, generative, and diffusion techniques to align visual and textual features via large pre-trained vision-language and language models.
Evaluation relies on metrics like Recall@K and spatial reasoning scores, while addressing challenges such as predicate diversity, spatial generalization, and open-set recognition.

Open-vocabulary scene graph generation (OV-SGG) refers to the task of constructing structured graph representations of complex visual scenes—where nodes encode objects/entities and edges encode their relationships—such that both object and relation labels are drawn from an unbounded, open vocabulary that encompasses concepts unseen during training. This paradigm is motivated by the need for flexible, extensible visual reasoning in unconstrained environments (e.g., robotic navigation, image retrieval, and captioning), where the traditional closed-set assumption (fixed noun and predicate categories) is inadequate. Modern OV-SGG approaches leverage large pre-trained vision-LLMs (VLMs), LLMs, and diffusion models, and employ both discriminative and generative mechanisms to predict scene structure beyond the constraints imposed by finite training label sets.

1. Fundamentals and Problem Formulation

In OV-SGG, given an input image (or 3D scene), the objective is to predict a scene graph $G = (V, E)$ with nodes $V = \{ v_i \}$ representing entities (e.g., objects, rooms) and directed edges $E = \{ e_{ij} \}$ representing relations (predicates) between entities. Unlike conventional SGG, the label spaces $\mathcal{O}$ (objects) and $\mathcal{R}$ (relations) are defined as $\mathcal{O} = \mathcal{O}_b \cup \mathcal{O}_n$ and $\mathcal{R} = \mathcal{R}_b \cup \mathcal{R}_n$ , where $_b$ and $_n$ denote base (seen) and novel (unseen at training) classes, respectively. OV-SGG thus enables both zero-shot inference of novel objects/relations and fine-grained predicate discovery (Elskhawy et al., 1 Apr 2025, Chen et al., 26 May 2025).

Formally, for a visual scene $I$ , the model predicts triplets $V = \{ v_i \}$ 0 where $V = \{ v_i \}$ 1, $V = \{ v_i \}$ 2, and both $V = \{ v_i \}$ 3 and $V = \{ v_i \}$ 4 may include concepts not present in the training data. This combinatorial openness presents unique technical challenges:

Combinatorial explosion of possible triplets
Distributional bias toward head classes
Alignment of visual and textual representations for rare or previously unseen predicates/objects.

2. Model Architectures and Core Methodologies

Modern OV-SGG frameworks can be grouped by architectural paradigms:

(a) Transformer-based unified frameworks

DETR-style end-to-end pipelines (e.g., OvSGTR, RAHP, ACC) use frozen visual/text backbones with transformer decoders that align visual tokens, object and predicate textual embeddings in a shared feature space (Chen et al., 26 May 2025, Liu et al., 2024, Li et al., 8 Nov 2025). These models exploit large pretrained vision and language architectures (e.g., Swin-Transformer, CLIP, BERT) and use cross-attention to jointly predict object boxes, class labels, and relation heads, supporting open-set generalization through visual-concept alignment and retention losses.

(b) Generative and prompt-based approaches

Sequence-based generation (e.g., PGSG) reframes SGG as an image-to-text task, where a VLM autoregressively generates scene-graph tokens (triplets or entity-predicate-entity) which are later parsed and grounded to boxes via attention-based modules (Li et al., 2024). Prompt-based fine-tuning (e.g., SVRP, HardPro) leverages region-caption pretraining and task-specific text templates to align visual and textual embeddings, freezing backbone parameters for maximal open-vocab retention (He et al., 2022).

(c) Role-based and context-adaptive prompting

SDSGG employs LLM role-playing to synthesize multiple, scene-adapted textual classifiers, generating diversified description pools for each relation and integrating a mutual visual adapter to model subject–object interaction directly in the visual feature space. A hierarchical renormalization mechanism dynamically weights each description's relevance per scene (Chen et al., 2024).

(d) Diffusion-inversion hybrid techniques

SPADE leverages the spatial structure maintained in the inversion process of diffusion models (e.g., DDIM), calibrating a denoising UNet via LoRA-based cross-attention adaptation. The internal cross-attention maps encode explicit spatial priors valuable for spatially grounded relation prediction, then pass these priors to a spatial-aware relation graph transformer for context-aware relation query generation (Hu et al., 8 Jul 2025).

3. Open-Vocabulary Predicate and Entity Discovery

Open-vocabulary predicate prediction is central to OV-SGG. Current approaches address this challenge using multiple mechanisms:

Text embedding alignment: Entities and candidate predicates are embedded using large language encoders (e.g., CLIP, BERT), and models compute cosine similarity or projection-based scores between visual features and free-form textual queries. Hierarchical prompt-based methods (RAHP) decompose class names into "super-entities" via semantic clustering and region-aware prompts obtained from LLM decomposition, vastly increasing the diversity and coverage of predicate embeddings (Liu et al., 2024).
LLM-based triplet extraction: Zero-shot frameworks like PRISM-0 use VLMs to generate natural-language captions for object pairs, then invoke an LLM with a chain-of-thought prompt to extract <subject, predicate, object> triplets. Predicates can be fine-grained or coarse-grained, allowing for arbitrary vocabulary extension. Relation proposals are further validated with a VQA oracle to ensure semantic plausibility (Elskhawy et al., 1 Apr 2025).
Sequence generation: In PGSG, the VLM generates scene graphs as linear token sequences where predicates are not constrained to a closed set. This generative pipeline enables the model to output relational phrases absent from the training annotations (Li et al., 2024).

4. Relation-aware Pretraining, Retention, and Distillation

Scaling OV-SGG to real-world data requires training schemes that propagate predicate semantics and prevent catastrophic forgetting:

Weakly-supervised and synthetic pretraining: Large-scale pretraining on synthetic or weakly labeled scene graphs (generated by LLMs/multimodal LLMs, region-caption pairs, or relation-specific VLM prompts) transfers predicate knowledge from broad text corpora to visual feature spaces. For example, FG-OV and relation-aware pretraining with synthetic captions provide substantial gains on rare- and tail-predicate generalization (Neau et al., 1 Sep 2025, Chen et al., 26 May 2025).
Knowledge retention and distillation: L1 or KL-based feature distillation is introduced during fine-tuning to align student models’ visual–language features with those of a frozen, pre-trained teacher. This preserves the semantic alignment for rare or out-of-distribution classes (Chen et al., 26 May 2025, Li et al., 8 Nov 2025, Chen et al., 2023). ACC and INOVA augment this with interaction-centric distillation losses focusing on pairwise relation structure and interaction-consistent triplet features (Li et al., 8 Nov 2025, Li et al., 6 Feb 2025).
Interaction-centric and query selection mechanisms: By explicitly modeling object–object interactions (interaction-guided query selection, interaction-aware target generation), these frameworks prioritize query assignment and attention to interacting object pairs, reducing noise from non-interacting background and improving relation disambiguation (Li et al., 6 Feb 2025, Li et al., 8 Nov 2025).

5. Spatial and Long-range Context Modeling

Spatial reasoning—especially for relative geometric predicates ("on", "left of", "behind")—remains a central challenge:

Diffusion-based spatial priors: SPADE leverages cross-attention maps from diffusion inversion, encoding spatial layout cues unavailable to standard VLMs, and fine-tunes only the cross-attention heads via LoRA for efficiency (Hu et al., 8 Jul 2025).
Graph-based and transformer spatial encoders: RAHP and similar hierarchical methods cluster entities and generate context-wise region-aware prompts, which are selected dynamically for each relation proposal, ensuring that both global and local spatial context inform predicate scoring (Liu et al., 2024). LLaVA-SpaceSGG designs dedicated spatial embeddings (relative box offsets, depth cues) that inform both region tokens and relation classification heads, supporting precise layer-based spatial relationships (Xu et al., 2024).
3D and viewpoint-invariant models: In volumetric or point cloud settings, frameworks such as VIZOR and Open3DSG generate open-vocabulary scene graphs directly from 3D data via segmentation, attribute extraction, and relation labeling based on geometric rules and LLM synthesis, achieving viewpoint invariance and dense spatial relationship sets (Madhavaram et al., 31 Jan 2026, Koch et al., 2024).

6. Evaluation Protocols, Metrics, and Downstream Impact

OV-SGG evaluation departs from traditional closed-set metrics to accommodate open sets and long-tail distributions:

Recall@K and mean-Recall@K (mR@K): Core metrics still include R@K and mR@K, but these are reported for both base and novel splits (OvD, OvR, OvD+R), and for rare or tail classes to quantify generalization (Chen et al., 26 May 2025, Li et al., 8 Nov 2025, Liu et al., 2024, Chen et al., 2023).
Reference-free image–relation alignment metrics: Metrics such as RelCLIPScore do not require gold predicate sets and instead compute CLIP-based alignment between joint region crops and the predicted relation text, robust to missing or incomplete ground truth and multi-label scenarios. Penalty terms are introduced to avoid trivial solutions and promote relationship diversity (Neau et al., 1 Sep 2025).
Downstream tasks: Scene-graph-guided captioning, retrieval (sentence-to-graph), VQA, and spatial QA are used to demonstrate the value of the generated graphs for real-world vision–language applications. For instance, PRISM-0 achieves zero-shot performance matching or exceeding supervised SGG on VQA and S2GR tasks (Elskhawy et al., 1 Apr 2025), and PGSG pretraining bolsters performance on RefCOCO and GQA (Li et al., 2024).

7. Open Challenges and Future Directions

Despite rapid advances, OV-SGG faces several open technical questions:

Predicate diversity and label noise: Ensuring high-precision, fine-grained, and diverse predicate distributions without hallucination or semantic drift from open predictions (Elskhawy et al., 1 Apr 2025, Liu et al., 2024).
Spatial generalization: Capturing generalizable spatial relations—especially with respect to viewpoint, occlusion, and class imbalance—remains challenging, even for diffusion- and transformer-based models (Hu et al., 8 Jul 2025, Xu et al., 2024).
Evaluation: Defining robust, annotation-independent metrics for the unbounded predicate and entity space (e.g., RelCLIPScore) and comprehensive benchmarks for both 2D and 3D graphs (Neau et al., 1 Sep 2025).
Robotic and embodied applications: Frameworks such as Point2Graph, OGScene3D, ZING-3D, and VIZOR enable online, incremental, and 3D-anchored graph construction, crucial for navigation and interaction tasks in open environments (Xu et al., 2024, Zhu et al., 17 Mar 2026, Saxena et al., 24 Oct 2025, Madhavaram et al., 31 Jan 2026).
Integration with LLMs and advanced VLMs: The use of multi-persona LLM prompting, dynamic prompt generation (RAHP, SDSGG), and hybrid architectures that blend generative, discriminative, and diffusion techniques presents ongoing opportunities to extend OV-SGG beyond current benchmarks (Chen et al., 2024, Liu et al., 2024).

Open-vocabulary scene graph generation thus represents an active frontier at the intersection of vision–language modeling, prompt engineering, spatial reasoning, and open-set recognition, with ongoing innovation in model architectures, training paradigms, and evaluation methodology across both 2D and 3D domains.