Papers
Topics
Authors
Recent
2000 character limit reached

RS-Net: Context-Aware Relation Scoring

Updated 14 November 2025
  • RS-Net is a modular, context-aware scoring framework that evaluates object relations in videos by integrating spatial interactions and temporal context.
  • It employs Transformer-based encoders to capture intra-frame and inter-frame cues, thereby improving relation classification and mitigating long-tail effects.
  • RS-Net integrates seamlessly with existing DSGG backbones and demonstrates measurable gains in recall, precision, and mean recall on the Action Genome dataset.

Relation Scoring Network (RS-Net) is a modular context-aware scoring framework designed to improve Dynamic Scene Graph Generation (DSGG) in videos by learning to score the “meaningfulness” of object pairs based on both spatial interactions and temporally aggregated video context. Distinct from previous approaches, RS-Net directly models the distinction between meaningful and irrelevant pairs, integrating contextual scoring via spatial and temporal Transformer-based encoders. The design enables seamless integration into existing DSGG backbones with minimal architectural changes, yielding improvements in Recall, Precision, and mean Recall, especially in the presence of long-tailed relation distributions. RS-Net was introduced and evaluated on the Action Genome dataset, demonstrating both empirical efficacy and computational efficiency (Jo et al., 11 Nov 2025).

1. Dynamic Scene Graph Generation and Problem Setup

In DSGG, the goal is to produce, for a sequence of video frames {It}t=1T\{I_t\}_{t=1}^T, a corresponding sequence of scene graphs: Gt=(Vt,Et),t=1,,T,G_t = (V_t, E_t), \quad t = 1, \dots, T, where Vt={vti}i=1NtV_t = \{v^i_t\}_{i=1}^{N_t} is the set of detected objects (nodes) in frame tt, and $E_t = \{(i, j, r^k_t)\}_{k=1}^{K(t)}}$ is the set of K(t) subject–predicate–object (triplet) relations with labels rtkr^k_t.

Each object ii at time tt is assigned:

  • Visual feature vtiRdv\mathbf{v}^i_t \in \mathbb{R}^{d_v},
  • Bounding box bti\mathbf{b}^i_t,
  • Category distribution dti\mathbf{d}^i_t.

For each ordered pair (i,j)(i, j), a relation representation xt(i,j)\mathbf{x}^{(i,j)}_t is constructed, and each possible predicate class rr receives a predicate score st,r(i,j)s^{(i,j)}_{t,r}. Conventional approaches lack explicit discrimination between related and unrelated pairs, limiting their ability to suppress semantically vacuous predictions during inference. RS-Net addresses this by computing a contextual “relation score” for all pairs and integrating this score into downstream triplet predictions.

2. Spatial and Temporal Context Encoders

2.1 Spatial Context Encoder

The spatial encoder captures intra-frame contextual cues via Transformer-based self-attention over relation features.

  • Relation features for frame tt:

xtk=[vtiˉ, vtjˉ, utijˉ, dtiˉ, dtjˉ]Rdm,\mathbf{x}^k_t = [\bar{\mathbf{v}^i_t},\ \bar{\mathbf{v}^j_t},\ \bar{\mathbf{u}^{ij}_t},\ \bar{\mathbf{d}^i_t},\ \bar{\mathbf{d}^j_t}] \in \mathbb{R}^{d_m},

where “bar” denotes learned linear projection and utijˉ\bar{\mathbf{u}^{ij}_t} is the union-box RoI feature.

  • The feature sequence is prepended with a learnable [Spa] context token ctRdm\mathbf{c}_t \in \mathbb{R}^{d_m}:

ZtSpa=[ctxt1xtK(t)],\mathbf{Z}_t^{\rm Spa} = [\mathbf{c}_t \| \mathbf{x}^1_t \| \cdots \| \mathbf{x}^{K(t)}_t],

processed by LsL_s Transformer layers:

Z^tSpa=TransformerLs(ZtSpa),\hat{\mathbf{Z}}_t^{\rm Spa} = \mathrm{Transformer}_{L_s}(\mathbf{Z}_t^{\rm Spa}),

yielding an updated context token c^t=Z^tSpa[0]\hat{\mathbf{c}}_t = \hat{\mathbf{Z}}_t^{\rm Spa}[0] and enriched per-pair relation features x^tk\hat{\mathbf{x}}^k_t.

2.2 Temporal Context Encoder

To aggregate video-level context, the spatial context tokens are stacked across the sequence:

  • Temporal input sequence:

ZTemp=[ctmpc^1c^2c^T],\mathbf{Z}^{\rm Temp} = [\mathbf{c}_{\rm tmp} \| \hat{\mathbf{c}}_1 \| \hat{\mathbf{c}}_2 \| \cdots \| \hat{\mathbf{c}}_T],

where ctmp\mathbf{c}_{\rm tmp} is a learnable [Tmp] token.

  • After adding a learnable positional embedding Epos\mathbf{E}_{\rm pos}, the sequence is processed via LtL_t Transformer layers:

Z^Temp=TransformerLt(Q,K=ZTemp+Epos, V=ZTemp),\hat{\mathbf{Z}}^{\rm Temp} = \mathrm{Transformer}_{L_t}(Q,K = \mathbf{Z}^{\rm Temp} + \mathbf{E}_{\rm pos},\ V = \mathbf{Z}^{\rm Temp}),

with the output c^tmp=Z^Temp[0]\hat{\mathbf{c}}_{\rm tmp} = \hat{\mathbf{Z}}^{\rm Temp}[0] serving as a video-level context token.

3. Unified Triplet Scoring and Losses

3.1 Relation Scoring Decoder

For each enriched pair x^tk\hat{\mathbf{x}}^k_t, the decoder concatenates this feature with c^tmp\hat{\mathbf{c}}_{\rm tmp} and applies a small MLP followed by softmax: ptk=[pt,0k,pt,1k]=softmax(MLP([x^tk;c^tmp])),\mathbf{p}^k_t = [p^k_{t,0}, p^k_{t,1}] = \mathrm{softmax}\left(\mathrm{MLP}([\hat{\mathbf{x}}^k_t; \hat{\mathbf{c}}_{\rm tmp}])\right), where pt,0kp^k_{t,0} quantifies the “meaningfulness” and pt,1kp^k_{t,1} the “irrelevance” of relation kk at time tt.

3.2 Triplet Score Fusion

For standard DSGG, the base confidence for triplet (i,j,r)(i,j,r) is given by: stk,base=sti(sub)×stj(obj)×stk(rel),s^{k,\rm base}_t = s^i_t(\mathrm{sub}) \times s^j_t(\mathrm{obj}) \times s^k_t(\mathrm{rel}), where stis^i_t and stjs^j_t are subject/object detection confidences, and stk(rel)s^k_t(\mathrm{rel}) is the per-predicate score. RS-Net injects contextual awareness by fusing its score: stk=stk,base×pt,0k,s^{k}_t = s^{k,\rm base}_t \times p^k_{t,0}, which serves to suppress spurious triplets and emphasize contextually relevant ones.

3.3 Loss Functions

The total training loss is a sum of three terms: Ltotal=Lod+Lrel+LRSN\mathcal{L}_{\rm total} = \mathcal{L}_{\rm od} + \mathcal{L}_{\rm rel} + \mathcal{L}_{\rm RSN}

  • Object detection:

Lod=t,igtilogdti\mathcal{L}_{\rm od} = - \sum_{t,i} \mathbf{g}^i_t \log \mathbf{d}^i_t

  • Predicate classification (multi-label pairwise ranking):

Lrel=mS+nSmax(0,1sm+sn)\mathcal{L}_{\rm rel} = \sum_{m\in\mathcal{S}^+}\sum_{n\in\mathcal{S}^-} \max(0, 1 - s_m + s_n)

  • Relation scoring (focal loss):

LRSN=t,kc{0,1}αc(1pt,ck)γlogpt,ck\mathcal{L}_{\rm RSN} = -\sum_{t,k}\sum_{c\in\{0,1\}} \alpha_c(1 - p^k_{t,c})^\gamma \log p^k_{t,c}

with αc\alpha_c and focusing parameter γ\gamma to balance class prevalence and focus.

4. Integration with Dynamic Scene Graph Generation Frameworks

RS-Net is architected for modularity:

  • At inference, any DSGG backbone supplies per-relation feature embeddings rtk\mathbf{r}^k_t.
  • Video context (c^tmp\hat{\mathbf{c}}_{\rm tmp}) is concatenated: [rtk;c^tmp][\mathbf{r}^k_t; \hat{\mathbf{c}}_{\rm tmp}] and fed to existing predicate classification heads.
  • In parallel, the RS-Net MLP provides scores ptk\mathbf{p}^k_t, whose meaningfulness probability is multiplied into the final triplet score.

This procedure requires no modification to the object detector or scene graph construction logic. It is compatible with various DSGG backbones, including STTran, STKET, and DSG-DETR, among others.

5. Experimental Evaluation

On the Action Genome benchmark, RS-Net demonstrates consistent gains in recall and precision, especially in mean Recall (mR), which is critical for long-tailed relation distributions.

Backbone R@10 (Baseline → RS-Net) P@10 (Baseline → RS-Net) Δ\Delta mR@10 (STTran)
STTran 25.1 → 28.3 (+3.2) 17.9 → 20.7 (+2.8) +2.4
STKET 26.4 → 28.9 (+2.5) 18.9 → 21.2 (+2.3) -
DSG-DETR 30.3 → 30.5 (+0.2) 22.1 → 22.2 (+0.1) -

Additional results:

  • SGCLS: DSG-DETR R@10 = 49.9 → 50.5, P@10 = 56.6 → 57.0.
  • Ablation: Removal of the temporal encoder reduces SGDET R@10 from 28.3 to 28.0.
  • Learnable ctmp\mathbf{c}_{\rm tmp} token outperforms mean-pooling (28.3 vs. 28.0 R@10).
  • Context-fusion produces small but consistent improvements.
  • Precision and FPS remain competitive: e.g., STTran increases from 0.74 to 0.75 FPS after RS-Net integration.

6. Computational Considerations

The addition of RS-Net raises parameter counts (e.g., STTran grows from 126.3M to 158.6M parameters, a +32M increase), with minimal or slightly improved frame-per-second throughput. Negative sampling in the RSN loss speeds convergence by reducing exposure to noisy negatives, and GPU parallel efficiency benefits from aligned tensor shapes.

7. Analysis, Limitations, and Extensions

RS-Net’s explicit learning of “meaningful vs. irrelevant” relations mitigates long-tail effects by down-weighting predominant but semantically empty co-occurrences. The approach is modular, back-end agnostic, and leverages both intra-frame and inter-frame cues. A limitation is the isolated treatment of relations, suggesting future incorporation of graph propagation. Potential extensions include dynamic temporal windows, incorporation of cross-video context, and unsupervised relation mining. A plausible implication is that RS-Net’s context-driven score may benefit other video-based relational or multi-object understanding tasks within and beyond DSGG frameworks.

Pseudocode Illustration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for each video:
  for t in 1..T:
    detect objects → features {v_t^i, d_t^i, b_t^i}
    build relation set X_t = [x_t^k] for all pairs
    Z_s = [c_t; X_t]                              # Spatial input
    Z_s_hat = Transformer_s(Z_s)                  # Spatial encoder
    save c_t_hat = Z_s_hat[0], x_t_hat = Z_s_hat[1:]
  Z_temp = [c_tmp; c_1_hat; …; c_T_hat]           # Temporal input
  Z_temp_hat = Transformer_t(Z_temp + E_pos)      # Temporal encoder
  c_tmp_hat = Z_temp_hat[0]
  for all relations x_t_hat[k]:
    p_tk = softmax( MLP( concat(x_t_hat[k], c_tmp_hat) ) )
    fused_score = base_triplet_score * p_tk[0]
    predicate_logits = existing_heads( concat(r_tk, c_tmp_hat) )
    compute losses: od, rel, RSN

RS-Net’s core contribution is a unified, context-sensitive scoring mechanism for pairwise relations in DSGG, facilitating robust, efficient, and accurate video scene understanding in the presence of challenging data distributions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Relation Scoring Network (RS-Net).