Papers
Topics
Authors
Recent
2000 character limit reached

Click2Graph: User-Guided Video Scene Graphs

Updated 22 November 2025
  • Click2Graph is an interactive system for panoptic video scene graph generation that converts single visual prompts into temporally consistent and interpretable outputs.
  • It integrates SAM2 video segmentation with subject-conditioned object discovery and semantic-predicate reasoning to generate fine-grained panoptic masks and subject–object triplets.
  • By focusing processing on user-specified subjects, Click2Graph minimizes computational overhead while achieving competitive performance in recall and spatial interaction metrics.

Click2Graph is an interactive system for Panoptic Video Scene Graph Generation (PVSG) that responds to a single human visual prompt—such as a click, bounding box, or rough mask in any frame—to produce temporally consistent, interpretable video scene graphs. Integrating the promptable video segmentation capabilities of SAM2 with subject-conditioned object discovery and semantic–predicate reasoning, Click2Graph delivers user-controllable and focused video scene understanding. The framework generates fine-grained panoptic masks, discovers and segments interacting objects, and predicts subject–predicate–object triplets, maintaining temporal consistency across frames and producing compact, interpretable representations of subjective visual relationships (Ruschel et al., 20 Nov 2025).

1. Motivation and Problem Definition

Panoptic Video Scene Graph Generation (PVSG) requires generating, for every frame of a video, accurate panoptic segmentation masks (covering both “things” and “stuff” with pixel-level precision), semantic triplets describing entity interactions, and temporally linked tracklets for instance and relation continuity. Traditional PVSG pipelines operate in a fully automatic, feed-forward manner, often utilizing dense object-pair proposals and yielding limited user agency or correction capability. This approach results in excessive computational overhead and lack of relevance for downstream applications where focus on user-specified entities is crucial.

Click2Graph addresses these limitations by introducing controllability through interactive prompting. The user specifies a subject of interest with a single click, bounding box, or mask, which serves as an explicit input to the system. This prompt steers the segmentation and relational reasoning exclusively toward the subject, enabling focused and efficient computation, as well as interpretable outputs directly anchored to user intent. Efficiency arises because downstream analysis conditions exclusively on the prompted subject, circumventing the need to process all possible object pairs per frame (Ruschel et al., 20 Nov 2025).

2. Architecture and Pipeline

The Click2Graph pipeline decomposes into several tightly integrated components:

  • Inputs: Video frames (V={I1,...,IT})(V = \{I_1, ..., I_T\}) and a user prompt PP (point, box, or mask) at frame ii.
  • SAM2 Backbone: The prompt PP is encoded by a frozen SAM2 video segmentation model, producing a subject mask sequence (SM1T)(SM_{1\ldots T}) propagated over all frames, along with per-frame visual features (F1T)(F_{1\ldots T}) via a Vision Transformer (ViT) encoder.
  • Dynamic Interaction Discovery Module (DIDM): At each frame, DIDM processes FtF_t and SMtSM_t to predict NqN_q 2D points {p^j}\{\hat{p}_j\} as subject-conditioned prompts for potential interacting objects.
  • Object Segmentation: Each predicted object point p^j\hat{p}_j is used as a prompt to SAM2’s mask decoder, generating object masks (OMj,t)(OM_{j,t}) and associated feature tokens.
  • Semantic Classification Head (SCH): Aggregates features over time for subject and objects, then for each subject–object pair predicts the subject (ss), object class (oj)(o_j), and predicate (rj)(r_j), outputting sets s,oj,rj,SM1T,OMj,1Tj=1Nq\langle s, o_j, r_j, SM_{1\ldots T}, OM_{j,1\ldots T} \rangle_{j=1\ldots N_q}.

Data flow proceeds as follows:

  1. PP \to SAM2 Prompt Encoder SMt,Ft\to SM_t, F_t
  2. (Ft,SMt)(F_t, SM_t) \to DIDM {p^j}\to \{\hat{p}_j\}
  3. {p^j}\{\hat{p}_j\} + SAM2 Prompt Encoder OMj,t\to OM_{j,t} (object masks/tokens)
  4. (SM-query,OMj,t-query)(SM\text{-}query, OM_{j,t}\text{-}query) \to SCH (s,oj,rj)\to (s, o_j, r_j)

This pipeline realizes end-to-end, subject-conditioned, user-guided PVSG with panoptic grounding and relational inference (Ruschel et al., 20 Nov 2025).

3. Dynamic Interaction Discovery and Semantic Classification

Dynamic Interaction Discovery Module (DIDM)

DIDM employs a lightweight Transformer decoder with NqN_q learnable object-query embeddings QobjQ_{obj} and a subject token qsubq_{sub}, initialized by fusing a learnable embedding esube_{sub} with mask-averaged features from FtF_t. Through LL decoder layers, cross-attention is performed between queries and FtF_t, enabling object queries to focus on regions likely to interact with the subject.

Refined object queries qjfq_j^f are passed through a point-prediction head (two-layer MLP plus sigmoid activation), yielding normalized 2D coordinates p^j\hat{p}_j. Ground-truth object points pjp_j^* are sampled within real object masks using distance-transform biasing, and predictions are matched to real objects via Hungarian assignment. DIDM is trained using an L2L_2 localization loss

LL2=j=1Nqp^jpπ(j)22L_{L2} = \sum_{j=1}^{N_q} \|\hat{p}_j - p^*_{\pi(j)}\|_2^2

with matching as in DETR. Set-based matching enables fixed-size queries NqN_q to handle scenes where the true number of interacting objects varies (MiNqM_i \leq N_q) (Ruschel et al., 20 Nov 2025).

Semantic Classification Head (SCH)

For each subject–object mask pair, backbone features are pooled spatially over masks to yield fsubf_{sub} and fobjjf_{obj_j}. Additionally, mask-query tokens from SAM2 are extracted for both categories. Classification proceeds via MLPs:

  • Subject MLP: fsubf_{sub} \to linear \to softmax over 126 entity classes.
  • Object MLP: fobjjf_{obj_j} \to linear \to softmax over 126 classes.
  • Relation MLP: [fsub;fobjj][f_{sub}; f_{obj_j}] \to two-layer MLP \to softmax over 57 predicates.

Each semantic output receives a cross-entropy loss, combined as Lsem=Lsub+Lobj+LrelL_{sem} = L_{sub} + L_{obj} + L_{rel}, where LsubL_{sub}, LobjL_{obj}, and LrelL_{rel} correspond to subject, object, and relationship classification, respectively.

4. Panoptic Grounding, Temporal Association, and Dataset

Click2Graph leverages SAM2’s video attention and recurrence to propagate subject and object masks with temporal consistency. Once a prompt is issued, its associated mask sequence remains coherent and identity-preserving throughout all video frames, eliminating the need for explicit tracklet re-association.

Tracklet assembly is automatic: each detected entity or object prompt directly generates a temporally linked mask tracklet. At evaluation, predicted mask tracklets are matched to ground-truth by enforcing IoU 0.5\geq 0.5 across frames for both subject and object tracks.

Training and evaluation leverage the OpenPVSG benchmark, comprising 400 videos (\sim150K frames, 5 FPS) sourced from VidOR, EPIC-Kitchens, and Ego4D, annotated for 126 object categories and 57 predicates with pixel-accurate, temporally linked masks and relations (Ruschel et al., 20 Nov 2025).

5. Optimization and Training Strategy

The SAM2.1-Large (224M parameters) backbone is kept frozen. DIDM and SCH together add approximately 5M trainable parameters. Training is conducted over 400 epochs, with 8-frame clip sampling per batch. DIDM is optimized using AdamW with a cosine-annealed learning rate (from 5×1055 \times 10^{-5} to 1×1051 \times 10^{-5}), while SCH uses a constant 5×1045 \times 10^{-4} learning rate.

The total loss comprises:

  • Mask loss: Lmask=LBCE+LIoU+LDiceL_{mask} = L_{BCE} + L_{IoU} + L_{Dice} with weights λBCE=10\lambda_{BCE}=10, λIoU=1\lambda_{IoU}=1, λDice=1\lambda_{Dice}=1
  • Localization: λL2=20\lambda_{L2} = 20
  • Semantic: λsub=10\lambda_{sub}=10, λobj=10\lambda_{obj}=10, λrel=20\lambda_{rel}=20
  • Total: Ltotal=Lmask+λL2LL2+Lsub+Lobj+LrelL_{total} = L_{mask} + \lambda_{L2} L_{L2} + L_{sub} + L_{obj} + L_{rel}

6. Experimental Results and Ablation Studies

Performance is measured by:

  • Recall@K: Percentage of top-KK predicted triplets s^,o^,r^\langle \hat{s}, \hat{o}, \hat{r} \rangle matching ground-truth labels, with masks’ IoU 0.5\geq 0.5. Click2Graph achieves Recall@3 of 2.23% on OpenPVSG.
  • Spatial Interaction Recall (SpIR): Fraction of subject–object mask pairs with IoU 0.5\geq 0.5, ignoring class labels; results range from 18–25%.
  • Prompt Localization Recall (PLR): Proportion of object prompt points falling inside true object masks; Click2Graph achieves PLR of 30–40%.

Relative to fully automated PVSG baselines (generating \sim100 pairs/frame, Recall@20 ≈ 3–4%), Click2Graph’s targeted, prompted Recall@3 is competitive, given its interactive and parsimonious setting (Nq=3N_q=3 candidates per subject).

Ablation studies highlight:

  • Prompt-type robustness: Minimal degradation between single-point, bounding-box, and mask prompts.
  • DIDM necessity: Replacing DIDM with a heuristic heatmap sampler yields substantial drops in PLR (>20 points), SpIR (>15 points), and Recall@3 (>1.4 points).
  • Semantic bottleneck: Using ground-truth predicates instead of predicted ones increases Recall@3 by \sim1.2 points, confirming that predicate inference is the key error source.
  • Qualitative analysis: The system correctly grounds multi-object interactions (e.g., ⟨adult, box, holding⟩), is robust to occlusions, but often confuses semantically close predicates and visually similar objects (Ruschel et al., 20 Nov 2025).

7. Strengths, Limitations, and Prospective Advances

Strengths:

  • Enables directly controllable, user-centered video understanding.
  • Significantly reduces computational burden by restricting downstream relational reasoning to Nq=3N_q=3 subject-object pairs.
  • Integrates interactive segmentation, subject-conditioned prompting, and joint semantic–predicate reasoning.

Limitations:

  • Semantic errors are dominant, particularly among visually similar object classes and fine-grained predicates.
  • No semantic feedback loop during inference; the user can only re-prompt segmentation, not correct class/relationship predictions.
  • Fixed NqN_q limits the system’s ability to encode more than three simultaneous subject-object interactions without repeated prompting.

Future directions:

Expansion of Click2Graph may entail: (1) user-feedback mechanisms for correcting predicted classes and relationships, allowing online adaptation; (2) integration of LLMs for improved predicate disambiguation and generalization; (3) support for multi-subject and batch prompting to construct richer scene graphs; and (4) dynamic estimation of NqN_q in response to scene complexity.

Click2Graph advances the field of interactive video scene graph generation by synthesizing promptable panoptic segmentation, subject-focused object discovery, and semantic–predicate inference into a unified, efficient framework. Its design establishes a foundation for controllable, interpretable video analysis driven directly by human input (Ruschel et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Click2Graph.