Papers
Topics
Authors
Recent
Search
2000 character limit reached

JVGN: Joint Visual Grounding with Scene Graphs

Updated 29 January 2026
  • The paper introduces JVGN's novel approach that jointly infers groundings for referents and context using language scene graphs.
  • It employs a CRF-based graphical model with sum-product message passing to integrate unary and relational visual-text alignment.
  • The framework outperforms previous methods on RefCOCO benchmarks, offering improved interpretability through explicit contextual modeling.

Joint Visual Grounding with Language Scene Graphs (JVGN) refers to a family of vision-LLMs that explicitly leverage the compositional structure of referring expressions by representing them as language scene graphs, and jointly infer groundings for both referents and unlabeled context objects/relations to improve both accuracy and interpretability. This approach stands in contrast to traditional models that treat referring-expression grounding as a holistic mapping from sentence to region, lacking explicit modeling of context and relation structure. JVGN represents a principled graphical model approach where sentence syntax and semantics are reflected in a scene-graph and inference leverages both unary and relational potentials, enabling context-aware, explainable visual grounding.

1. Language Scene Graphs and Graphical Modeling

Central to JVGN is the concept of a Language Scene Graph (LSG), which encodes the parse of a referring expression as a directed acyclic graph $\G=(\V,\E)$, where:

  • Each node $v_i\in\V$ denotes an entity mention—consisting of a head noun and associated attributes (e.g., v=(truck,{white})v = (\text{truck}, \{\text{white}\})).
  • Directed, labeled edges $(v_s, r, v_o)\in\E$ correspond to binary relations as expressed in the referring phrase (e.g., in “the white truck in front of the yellow one,” relations such as “in front of”).

The LSG identifies a unique referent node v1v_1 as the entity being grounded (typically the only node with zero in-degree), while all other nodes correspond to contextual objects/attributes that disambiguate the referent. This graph structure induces a conditional random field (CRF) where each node variable γi\gamma_i ranges over all candidate image regions {b1,,bN}\{b_1,\dots,b_N\} and the joint probability of groundings is factorized as:

$P(\gamma_1,\dots,\gamma_M\mid \G,\I)\;\propto\;\prod_{v_i\in\V}\psi_v(\gamma_i) \;\prod_{(v_s,r,v_o)\in\E}\psi_e\bigl(\gamma_s,\gamma_o\bigr)$

where ψv\psi_v is a unary potential encoding visual-textual similarity, and ψe\psi_e is a binary potential encoding visual consistency under textual relations (Liu et al., 2019).

2. Inference by Marginalization and Message Passing

A unique property of JVGN is that, while only the referent node receives ground-truth supervision (i.e., provides a labeled region in the image), the graphical formulation enables joint likelihood maximization via marginalization over all context nodes. The marginalization objective is:

$P(\gamma_1\mid\G,\I) = \sum_{\gamma_2,\dots,\gamma_M}P(\gamma_1,\dots,\gamma_M\mid\G,\I)$

Exact inference is tractable in loop-free graphs and is performed by sum-product belief propagation (message passing) for two iterations, yielding node marginals P(γi)P(\gamma_i) for all objects in the scene graph. During learning, only the referent node's marginal is forced to match its ground-truth, either as a hard (one-hot) target or soft (IoU-based) distribution over boxes. The training objective is the Kullback–Leibler divergence between predicted and target referent marginals; in the "gt-setting," this reduces to standard cross-entropy loss, while in the detection (det) setting with automatically detected regions, a soft label is used:

$L = \sum_i p_i^*\log\!\frac{p_i^*}{P(\gamma_1=b_i)} \qquad \text{with}\quad p_i^*\propto\max\{0, \IoU(b_i,b_{gt})-\eta\}$

Marginalization during both training and inference yields higher accuracy compared to ablations that apply it to only one phase or not at all (Liu et al., 2019).

3. Joint Grounding: Accuracy, Interpretability, and Context

Empirically, JVGN achieves state-of-the-art performance on major referring expression grounding datasets (RefCOCO, RefCOCO+, RefCOCOg) in both "gt" and "det" settings. Example quantitative results (det-setting):

Dataset JVGN (%) Prior Best (%)
RefCOCO 84.51 83.14
RefCOCO+ 76.06 73.65
RefCOCOg 71.01 68.67

Supporting-object (context) accuracy is highest among all tested baselines, validated by both automated metrics and human evaluation. Human raters judged JVGN's scene-graph grounding as more "clear" (approx. 73% clarity) compared to tree-based baselines (approx. 59%). Qualitative analyses show JVGN disambiguates between visually similar objects by leveraging relational context, with the message-passing process sharpening marginals at both referent and context nodes (Liu et al., 2019).

4. Comparative Frameworks and Extensions

Subsequent research has extended this paradigm to complex scene graph grounding tasks and video question answering. Notably:

  • Visio-Lingual Message Passing GNN (VL-MPAG Net) (Tripathi et al., 2022) generalizes to arbitrary scene graphs over images by fusing both linguistic and visual structure using multi-step message passing between graph proposals and scene graph queries. This yields superior grounding performance compared to CRF-based, node-only, or transformer triplet-matching methods across Visual Genome, COCO-Stuff, VRD, and other datasets.
  • In video QA, symbolic scene graph grounding has been modularized in the SG-VLM framework, which extracts, selects, and localizes scene graphs in video frames using only prompting and frozen vision-LLMs (VLMs like Qwen2.5-VL or InternVL) (Ma et al., 15 Sep 2025). Here, scene graphs provide symbolic, temporally-indexed representations that help disambiguate events and relations across video sequences, improving temporal and causal reasoning over vanilla VLMs.

5. Methodological Advancements: Inference, Training, and Architectural Details

The JVGN approach leverages dedicated pipelines for scene graph parsing, region proposal (e.g., Faster-R-CNN), and embedding alignment. Both features for regions ($\x_j$) and language (GloVe embeddings, BiLSTM encodings) are mapped to a common space for potential calculations. Factor potentials are defined as parametrized neural functions operating over normalized, linearly projected concatenations of features. During learning, standard optimizers (e.g., Adam with decay schedule) are used for end-to-end training, although more recent extensions like SG-VLM operate entirely with frozen LM backbones and prompting, eschewing additional parameter updates (Ma et al., 15 Sep 2025).

VL-MPAG Net introduces three-stage message passing (visio-lingual/auxiliary, intra-query-graph, intra-proposal-graph), with proposal/query node aggregation and auxiliary edge attention mechanisms, yielding context-dependent query-proposal similarities and direct optimization of both object and relation losses. This results in strong generalization even to unseen object classes.

6. Limitations, Open Challenges, and Future Directions

Identified limitations of JVGN and related frameworks include:

  • Reliance on off-the-shelf parsers for language scene graph extraction; errors in parsing propagate to downstream grounding.
  • Restriction to pairwise (binary) relations, omitting higher-order or compositional language phenomena.
  • Computational complexity scaling with sentence/graph size; although tractable for standard referring expressions, longer or denser queries pose challenges.
  • In SG-VLM, the quality of symbolic scene graphs is limited by the noisiness of VLM-based relation extraction, and explicit causal "why" questions remain a bottleneck (Ma et al., 15 Sep 2025).

Future research aims to integrate end-to-end differentiable scene-graph parsing, augment potential functions with richer spatial or higher-order context, and extend joint grounding techniques to a broader class of vision-language reasoning tasks, such as open-ended video question answering and compositional image captioning.

7. Context and Significance in the Vision-Language Landscape

JVGN and its descendants have established the value of explicit joint modeling of referent and context in visual grounding, providing a bridge between structured linguistic analysis and deep vision architectures. By mapping language scene graphs to image regions via message passing and marginalization, these approaches offer not only quantifiable gains in standard benchmarks but also improved interpretability by surfacing context-dependent and relational grounding decisions. This methodology underpins ongoing hybrid efforts that combine symbolic graph-based reasoning and foundation model backbones for interpretable, compositional scene understanding (Liu et al., 2019, Tripathi et al., 2022, Ma et al., 15 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Visual Grounding with Language Scene Graphs (JVGN).