SGCLIP: CLIP-based Scene Graph Generation

Updated 4 July 2026

SGCLIP is a CLIP-based foundation model specialized for open-domain scene graph generation that unifies entity, attribute, and relation scoring.
It employs a neurosymbolic pipeline trained on over 87K videos, using automatic captioning and programmatic supervision to align visual and textual data.
The model achieves significant improvements over vanilla CLIP in benchmarking tasks and plays a key role in reducing perception errors in ESCA-based systems.

SGCLIP most directly denotes Scene-Graph CLIP, the CLIP-based foundation model introduced within ESCA for open-domain scene graph generation. In that formulation, SGCLIP takes an image or image segment together with a set of textual concepts and returns probabilistic scores over entity classes, attributes, and relations, thereby producing scene graphs that can be consumed by embodied agents. The same acronym also appears more loosely across adjacent CLIP-derived literature as shorthand for sign-language, selective, sparse, semantic-guided, or self-calibrated variants, so its meaning is context dependent. The most explicit and canonical use in the provided literature is the ESCA model, where SGCLIP is trained on 87K+ open-domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, without human-labeled scene-graph annotations (Huang et al., 11 Oct 2025).

1. Nomenclature and scope

In the provided literature, SGCLIP is not a universally standardized name. It refers explicitly to Scene-Graph CLIP in ESCA, while several neighboring works describe CLIP-derived methods that are labeled as SGCLIP-like, SignCLIP, or SC-CLIP (or SGCLIP in some contexts). Orthographically similar names such as GS-CLIP are distinct methods with different objectives and architectures (Huang et al., 11 Oct 2025).

Designation	Meaning in the provided literature	arXiv id
SGCLIP	Scene-Graph CLIP for open-domain scene graph generation in ESCA	(Huang et al., 11 Oct 2025)
SignClip / SignCLIP	Gloss-free sign language translation with mouthing cues and multimodal contrastive fusion	(Wu et al., 12 Sep 2025)
SCL-SLT as SGCLIP-like	Selective contrastive learning for gloss-free sign language translation	(Lai et al., 24 Apr 2026)
Sparse CLIP as SGCLIP-style	Sparse, interpretable CLIP trained end-to-end with contrastive loss	(Qin et al., 27 Jan 2026)
SC-CLIP	Self-calibrated CLIP for training-free open-vocabulary segmentation	(Bai et al., 2024)
GS-CLIP	Geometry-aware prompt and synergistic view learning for zero-shot 3D anomaly detection	(Deng et al., 22 Feb 2026)

This suggests that SGCLIP functions both as a specific model name and as a broader family resemblance term for CLIP variants that inject additional structure into the base vision-language paradigm. Within that broader family, the common pattern is to retain CLIP’s shared visual-textual embedding space while modifying supervision, feature composition, or inference structure to improve grounding on tasks that exceed vanilla CLIP’s image-level alignment.

2. Scene-Graph CLIP as a promptable scene-graph model

In ESCA, SGCLIP is a CLIP-based foundation model specialized for open-domain scene graph generation. Its core primitive is a scoring function

$\mathrm{SGClip}(\sigma, \bar{c}) \in \mathbb{R}^{|\bar{c}|},$

where $\sigma$ is an image or image segment and $\bar{c}$ is a set of candidate concepts. SGCLIP reuses CLIP’s visual encoder and text encoder and computes similarity scores between encoded visual regions and encoded textual concepts; the principal changes lie in the inference formulations for entity classes, attributes, and binary relations rather than in a new encoder architecture (Huang et al., 11 Oct 2025).

For entity classification, SGCLIP scores a segment $\sigma$ against a candidate class set and applies a softmax to obtain a categorical distribution. For attribute prediction, it forms a binary contrast between an attribute and its negation, such as a concept $c$ versus $\neg c$ , and interprets the softmax probability of $c$ as the attribute confidence. For binary relation prediction, it computes an enclosing region for a subject-object pair, color-tints the two masks differently to preserve directionality, augments the relation phrase with predicted subject and object classes, and contrasts that phrase against a special <norel> token. The output is a probabilistic scene graph with unary facts for entity classes and attributes and binary facts for relations (Huang et al., 11 Oct 2025).

Promptability is central to the model’s design. SGCLIP does not assume a fixed scene-graph taxonomy. Instead, ESCA first performs concept extraction with an MLLM, producing task-aware sets of entities, attributes, and relations; object candidates are then identified with Grounding DINO + SAM2; SGCLIP scores those segments against the prompt-derived concepts; and the resulting scene graph is summarized back into text for the agent’s planner. This makes SGCLIP an open-vocabulary grounding module rather than a conventional closed-set scene-graph predictor (Huang et al., 11 Oct 2025).

A notable architectural property is that the same CLIP-derived scoring machinery is reused across all three predicate types. Entity, attribute, and relation prediction differ mainly in how textual concepts are composed and how visual regions are defined. This unification is one reason the model can be used both in prompt-based inference and in task-specific fine-tuning while remaining within CLIP’s original similarity-based framework (Huang et al., 11 Oct 2025).

3. Training data and neurosymbolic supervision

SGCLIP is trained on ESCA-Video-87K, a corpus built from LLaVA-Video-178K and consisting of 87,045 clips from ten source datasets: HD-VILA-100M, InternVid-10M, VidOR, VIDAL, YouCook2, Charades, ActivityNet, Kinetics-700, Something-Something v2, and Ego4D. Each datapoint is represented as a 5-tuple

$(\bar{I}, L_{\mathrm{cap}}, \Sigma, \bar{c}, \phi),$

containing video frames, a GPT-4-generated caption, object traces, a concept set, and a temporal programmatic specification (Huang et al., 11 Oct 2025).

The object traces $\Sigma$ are produced with a SAM2-based pipeline that combines dense prompting, forward and backward mask propagation, and iterative discovery of new objects through Grounding DINO or YOLO followed by SAM2 mask generation. Concept sets and temporal specifications are extracted from captions by GPT-4, which is prompted to produce event-centric symbolic descriptions with relations such as name, unary predicates, binary predicates, durations, and explicit temporal fractions. A compiler then validates and converts those specifications into executable programs (Huang et al., 11 Oct 2025).

Training is neurosymbolic. SGCLIP generates spatio-temporal scene graphs from object traces, and Scallop evaluates whether those probabilistic scene graphs satisfy the programmatic specification. The paper describes three loss components: a contrastive loss that distinguishes matching from mismatched video-specification pairs, a temporal loss that aligns predicted event satisfaction with annotated temporal intervals, and a semantic loss that penalizes implausible negative keywords selected from the top 5,000 frequent keywords using SpaCy-based semantic distance. Long specifications are split into chunks with up to 3 events to keep alignment tractable (Huang et al., 11 Oct 2025).

The optimization setup is correspondingly lightweight but structured. SGCLIP is fine-tuned from CLIP for 3 epochs on ESCA-Video-87K with learning rate $1\times10^{-6}$ , batch size 2, and 1 FPS video sampling. The semantic loss uses 5 negative keywords per instance with weight 0.1, and Scallop provenance uses difftopkproofs with top- $\sigma$ 0 proofs (Huang et al., 11 Oct 2025). The practical significance of this pipeline is that it replaces manually labeled scene graphs with specification-level supervision derived from automatically generated captions and symbolic programs.

4. Empirical performance and embodied-agent integration

SGCLIP is evaluated both as an independent grounding model and as the perception core of ESCA. In zero-shot scene-graph and relation benchmarks, the paper reports consistent gains over vanilla CLIP. On OpenPVSG, class R@1 improves from 16.33% for CLIP to 23.35–23.68% for SGCLIP; on Action Genome, class R@1 improves from 11.87% to about 17.68%; and on VidVRD, class R@1 rises from 62.67% to 71.00%. The same trend appears in relation prediction, and training on larger subsets of ESCA-Video-87K yields monotonic gains, indicating that the neurosymbolic supervision scales with data volume (Huang et al., 11 Oct 2025).

Transfer beyond scene graphs is demonstrated on ActivityNet. In zero-shot action recognition, SGCLIP reaches 76.34% accuracy versus 74.37% for CLIP. With 1% labeled training data, SGCLIP reaches 80.10% versus 78.79% for CLIP, and with 5% labeled data it reaches 86.05%, again exceeding the CLIP baseline. On VidVRD relation tagging after task-specific fine-tuning, SGCLIP initialization outperforms CLIP initialization, including Prec@10 = 0.278 and Rec@10 = 0.386 versus 0.246 and 0.353 for SGCLIP-CLIP (Huang et al., 11 Oct 2025).

Inside ESCA, SGCLIP is the visual grounding stage between concept extraction and visual summarization. The full pipeline is: concept extraction with an MLLM; segment generation through Grounding DINO + SAM2; scene-graph prediction by SGCLIP; and conversion of the probabilistic scene graph into textual and visual summaries for the planner. On embodied benchmarks, ESCA with SGCLIP improves both open-source and commercial MLLM agents. The paper reports that InternVL-2.5 + ESCA on EB-Navigation surpasses base GPT-4o, and that perception errors on EB-Navigation drop from 69% to 30% with ESCA. Similar reductions in perception-related failures are reported for EB-Manipulation, EB-Habitat, and EB-Alfred (Huang et al., 11 Oct 2025).

These results situate SGCLIP as more than a scene-graph benchmark model. In ESCA it functions as a structured grounding interface that maps low-level segments to symbolic facts such as object names, attributes, and relations, which the downstream MLLM can incorporate into planning. A plausible implication is that the model’s main utility lies not only in scene-graph accuracy but also in converting visual evidence into a representation that is more congenial to reasoning and action selection.

5. SGCLIP as a broader CLIP-derived design pattern

Outside ESCA, the provided literature shows that SGCLIP-like systems typically retain CLIP’s shared embedding space while adding explicit structure matched to a downstream task. In generalized category discovery, CLIP-GCD uses CLIP as a backbone and augments image embeddings with top- $\sigma$ 1 retrieved textual descriptions from corpora such as CC-3M, CC-12M, MS COCO, LAION-400M, and LAION-5B, then performs joint image+text semi-supervised clustering; this improves both known and novel class discovery, especially on fine-grained and out-of-distribution domains (Ouldnoughi et al., 2023). In gloss-free sign language translation, SignClip combines a frozen CLIP ViT-L/14 gesture stream with an Av-HuBERT mouthing stream and two InfoNCE-style objectives, yielding BLEU-4 = 24.71 and ROUGE-L = 48.38 on PHOENIX14T in the gloss-free setting (Wu et al., 12 Sep 2025).

A different axis is contrastive sample selection. Selective Contrastive Learning for SLT argues that random in-batch negatives are often trivial or false negatives in gloss-free sign language translation, especially when many texts are identical or near-duplicates. It introduces a Pair Selection strategy based on similarity trajectories from reference checkpoints and reports that on PHOENIX14T end-to-end training, BLEU-4 rises from 21.97 for the baseline and 22.03 for standard contrastive learning to 25.30 with selective contrastive learning (Lai et al., 24 Apr 2026). Here the SGCLIP-like aspect is not a new encoder but a new policy for constructing effective CLIP-style supervision.

Other variants alter representation geometry itself. Sparse CLIP integrates sparsity directly into CLIP training by replacing the final projection layers with overcomplete ReLU heads, preserving strong downstream performance while improving interpretability; on ViT-L/14, the sparse model improves average zero-shot classification from 75.1% to 75.6% and yields highly multimodal sparse units (Qin et al., 27 Jan 2026). Self-Calibrated CLIP modifies only inference-time behavior in the last visual layer, identifies anomaly tokens with LOF, recalibrates deep features using mid-layer similarity, and raises average training-free open-vocabulary segmentation performance to 43.9% mIoU on ViT-B/16 and 45.2% on ViT-L/14, including a 6.8× improvement over vanilla CLIP ViT-L/14 (Bai et al., 2024).

Orthographically similar 3D variants extend the same pattern into geometric domains. GS-CLIP for zero-shot 3D anomaly detection learns geometry-aware prompts with PointNet++ and a Geometric Defect Distillation Module, then fuses rendered and depth views with a Synergistic Refinement Module; on MVTec3D-AD it reports O-AUROC 83.6, O-AP 96.5, P-AUROC 96.3, and P-PRO 86.4 in the one-vs-rest setting (Deng et al., 22 Feb 2026). This suggests that, beyond the specific ESCA model name, SGCLIP has come to denote a recurrent research strategy: preserve CLIP’s cross-modal prior, then add task-specific mechanisms for locality, structure, sparsity, selection, or geometry.

6. Limitations and prospective directions

For Scene-Graph CLIP in ESCA, the principal limitations are tied to the full embodied pipeline. The paper notes latency and real-time constraints, since the system combines large MLLMs, SAM2, Grounding DINO, and SGCLIP; a 2D-only visual representation, with depth and 3D spatial reasoning only indirectly represented; dependence on GPT-4-generated captions and programmatic specifications, which can propagate biases or hallucinations; and the absence of an explicit temporal module inside SGCLIP itself, since temporal reasoning is delegated to the neurosymbolic layer (Huang et al., 11 Oct 2025).

Across adjacent SGCLIP-style methods, recurring bottlenecks are similarly structural rather than purely parametric. CLIP-GCD is sensitive to corpus quality, coverage, and the choice of top- $\sigma$ 2 retrieved captions (Ouldnoughi et al., 2023). SignClip depends on mouth visibility, landmark quality from IBUG / FAN, and the stability of mouthing features under occlusion or blur (Wu et al., 12 Sep 2025). SCL-SLT requires an expensive reference contrastive training stage to estimate negative-pair trajectories (Lai et al., 24 Apr 2026). Sparse CLIP introduces parameter and memory overhead through overcomplete projections and shows weaker retrieval than classification (Qin et al., 27 Jan 2026). SC-CLIP remains tightly coupled to ViT-style CLIP and requires backbone-specific tuning of anomaly counts, layer indices, and similarity thresholds (Bai et al., 2024). GS-CLIP assumes access to 3D point clouds and incurs additional inference cost from dual visual streams and multi-view rendering (Deng et al., 22 Feb 2026).

The aggregate research trajectory points toward increasingly structured CLIP variants. For SGCLIP proper, the most immediate extensions are those already indicated by ESCA: integrating 3D representations, adding stronger temporal modeling, reducing inference latency, and refining the interaction between prompt-derived concept sets and grounded scene-graph facts (Huang et al., 11 Oct 2025). A plausible implication is that future SGCLIP systems will be judged less by generic zero-shot similarity alone than by how well they mediate between perception, structured world models, and downstream reasoning.