RS-Net: Context-Aware Relation Scoring

Updated 14 November 2025

RS-Net is a modular, context-aware scoring framework that evaluates object relations in videos by integrating spatial interactions and temporal context.
It employs Transformer-based encoders to capture intra-frame and inter-frame cues, thereby improving relation classification and mitigating long-tail effects.
RS-Net integrates seamlessly with existing DSGG backbones and demonstrates measurable gains in recall, precision, and mean recall on the Action Genome dataset.

Relation Scoring Network (RS-Net) is a modular context-aware scoring framework designed to improve Dynamic Scene Graph Generation (DSGG) in videos by learning to score the “meaningfulness” of object pairs based on both spatial interactions and temporally aggregated video context. Distinct from previous approaches, RS-Net directly models the distinction between meaningful and irrelevant pairs, integrating contextual scoring via spatial and temporal Transformer-based encoders. The design enables seamless integration into existing DSGG backbones with minimal architectural changes, yielding improvements in Recall, Precision, and mean Recall, especially in the presence of long-tailed relation distributions. RS-Net was introduced and evaluated on the Action Genome dataset, demonstrating both empirical efficacy and computational efficiency (Jo et al., 11 Nov 2025).

1. Dynamic Scene Graph Generation and Problem Setup

In DSGG, the goal is to produce, for a sequence of video frames $\{I_t\}_{t=1}^T$ , a corresponding sequence of scene graphs: $G_t = (V_t, E_t), \quad t = 1, \dots, T,$ where $V_t = \{v^i_t\}_{i=1}^{N_t}$ is the set of detected objects (nodes) in frame $t$ , and $E_t = \{(i, j, r^k_t)\}_{k=1}^{K(t)}}$ is the set of K(t) subject–predicate–object (triplet) relations with labels $r^k_t$ .

Each object $i$ at time $t$ is assigned:

Visual feature $\mathbf{v}^i_t \in \mathbb{R}^{d_v}$ ,
Bounding box $\mathbf{b}^i_t$ ,
Category distribution $\mathbf{d}^i_t$ .

For each ordered pair $(i, j)$ , a relation representation $\mathbf{x}^{(i,j)}_t$ is constructed, and each possible predicate class $r$ receives a predicate score $s^{(i,j)}_{t,r}$ . Conventional approaches lack explicit discrimination between related and unrelated pairs, limiting their ability to suppress semantically vacuous predictions during inference. RS-Net addresses this by computing a contextual “relation score” for all pairs and integrating this score into downstream triplet predictions.

2. Spatial and Temporal Context Encoders

2.1 Spatial Context Encoder

The spatial encoder captures intra-frame contextual cues via Transformer-based self-attention over relation features.

Relation features for frame $t$ :

$\mathbf{x}^k_t = [\bar{\mathbf{v}^i_t},\ \bar{\mathbf{v}^j_t},\ \bar{\mathbf{u}^{ij}_t},\ \bar{\mathbf{d}^i_t},\ \bar{\mathbf{d}^j_t}] \in \mathbb{R}^{d_m},$

where “bar” denotes learned linear projection and $\bar{\mathbf{u}^{ij}_t}$ is the union-box RoI feature.

The feature sequence is prepended with a learnable [Spa] context token $\mathbf{c}_t \in \mathbb{R}^{d_m}$ :

$\mathbf{Z}_t^{\rm Spa} = [\mathbf{c}_t \| \mathbf{x}^1_t \| \cdots \| \mathbf{x}^{K(t)}_t],$

processed by $L_s$ Transformer layers:

$\hat{\mathbf{Z}}_t^{\rm Spa} = \mathrm{Transformer}_{L_s}(\mathbf{Z}_t^{\rm Spa}),$

yielding an updated context token $\hat{\mathbf{c}}_t = \hat{\mathbf{Z}}_t^{\rm Spa}[0]$ and enriched per-pair relation features $\hat{\mathbf{x}}^k_t$ .

2.2 Temporal Context Encoder

To aggregate video-level context, the spatial context tokens are stacked across the sequence:

Temporal input sequence:

$\mathbf{Z}^{\rm Temp} = [\mathbf{c}_{\rm tmp} \| \hat{\mathbf{c}}_1 \| \hat{\mathbf{c}}_2 \| \cdots \| \hat{\mathbf{c}}_T],$

where $\mathbf{c}_{\rm tmp}$ is a learnable [Tmp] token.

After adding a learnable positional embedding $\mathbf{E}_{\rm pos}$ , the sequence is processed via $L_t$ Transformer layers:

$\hat{\mathbf{Z}}^{\rm Temp} = \mathrm{Transformer}_{L_t}(Q,K = \mathbf{Z}^{\rm Temp} + \mathbf{E}_{\rm pos},\ V = \mathbf{Z}^{\rm Temp}),$

with the output $\hat{\mathbf{c}}_{\rm tmp} = \hat{\mathbf{Z}}^{\rm Temp}[0]$ serving as a video-level context token.

3. Unified Triplet Scoring and Losses

3.1 Relation Scoring Decoder

For each enriched pair $\hat{\mathbf{x}}^k_t$ , the decoder concatenates this feature with $\hat{\mathbf{c}}_{\rm tmp}$ and applies a small MLP followed by softmax: $\mathbf{p}^k_t = [p^k_{t,0}, p^k_{t,1}] = \mathrm{softmax}\left(\mathrm{MLP}([\hat{\mathbf{x}}^k_t; \hat{\mathbf{c}}_{\rm tmp}])\right),$ where $p^k_{t,0}$ quantifies the “meaningfulness” and $p^k_{t,1}$ the “irrelevance” of relation $k$ at time $t$ .

3.2 Triplet Score Fusion

For standard DSGG, the base confidence for triplet $(i,j,r)$ is given by: $s^{k,\rm base}_t = s^i_t(\mathrm{sub}) \times s^j_t(\mathrm{obj}) \times s^k_t(\mathrm{rel}),$ where $s^i_t$ and $s^j_t$ are subject/object detection confidences, and $s^k_t(\mathrm{rel})$ is the per-predicate score. RS-Net injects contextual awareness by fusing its score: $s^{k}_t = s^{k,\rm base}_t \times p^k_{t,0},$ which serves to suppress spurious triplets and emphasize contextually relevant ones.

3.3 Loss Functions

The total training loss is a sum of three terms: $\mathcal{L}_{\rm total} = \mathcal{L}_{\rm od} + \mathcal{L}_{\rm rel} + \mathcal{L}_{\rm RSN}$

Object detection:

$\mathcal{L}_{\rm od} = - \sum_{t,i} \mathbf{g}^i_t \log \mathbf{d}^i_t$

Predicate classification (multi-label pairwise ranking):

$\mathcal{L}_{\rm rel} = \sum_{m\in\mathcal{S}^+}\sum_{n\in\mathcal{S}^-} \max(0, 1 - s_m + s_n)$

Relation scoring (focal loss):

$\mathcal{L}_{\rm RSN} = -\sum_{t,k}\sum_{c\in\{0,1\}} \alpha_c(1 - p^k_{t,c})^\gamma \log p^k_{t,c}$

with $\alpha_c$ and focusing parameter $\gamma$ to balance class prevalence and focus.

4. Integration with Dynamic Scene Graph Generation Frameworks

RS-Net is architected for modularity:

At inference, any DSGG backbone supplies per-relation feature embeddings $\mathbf{r}^k_t$ .
Video context ( $\hat{\mathbf{c}}_{\rm tmp}$ ) is concatenated: $[\mathbf{r}^k_t; \hat{\mathbf{c}}_{\rm tmp}]$ and fed to existing predicate classification heads.
In parallel, the RS-Net MLP provides scores $\mathbf{p}^k_t$ , whose meaningfulness probability is multiplied into the final triplet score.

This procedure requires no modification to the object detector or scene graph construction logic. It is compatible with various DSGG backbones, including STTran, STKET, and DSG-DETR, among others.

5. Experimental Evaluation

On the Action Genome benchmark, RS-Net demonstrates consistent gains in recall and precision, especially in mean Recall (mR), which is critical for long-tailed relation distributions.

Backbone	R@10 (Baseline → RS-Net)	P@10 (Baseline → RS-Net)	$\Delta$ mR@10 (STTran)
STTran	25.1 → 28.3 (+3.2)	17.9 → 20.7 (+2.8)	+2.4
STKET	26.4 → 28.9 (+2.5)	18.9 → 21.2 (+2.3)	-
DSG-DETR	30.3 → 30.5 (+0.2)	22.1 → 22.2 (+0.1)	-

Additional results:

SGCLS: DSG-DETR R@10 = 49.9 → 50.5, P@10 = 56.6 → 57.0.
Ablation: Removal of the temporal encoder reduces SGDET R@10 from 28.3 to 28.0.
Learnable $\mathbf{c}_{\rm tmp}$ token outperforms mean-pooling (28.3 vs. 28.0 R@10).
Context-fusion produces small but consistent improvements.
Precision and FPS remain competitive: e.g., STTran increases from 0.74 to 0.75 FPS after RS-Net integration.

6. Computational Considerations

The addition of RS-Net raises parameter counts (e.g., STTran grows from 126.3M to 158.6M parameters, a +32M increase), with minimal or slightly improved frame-per-second throughput. Negative sampling in the RSN loss speeds convergence by reducing exposure to noisy negatives, and GPU parallel efficiency benefits from aligned tensor shapes.

7. Analysis, Limitations, and Extensions

RS-Net’s explicit learning of “meaningful vs. irrelevant” relations mitigates long-tail effects by down-weighting predominant but semantically empty co-occurrences. The approach is modular, back-end agnostic, and leverages both intra-frame and inter-frame cues. A limitation is the isolated treatment of relations, suggesting future incorporation of graph propagation. Potential extensions include dynamic temporal windows, incorporation of cross-video context, and unsupervised relation mining. A plausible implication is that RS-Net’s context-driven score may benefit other video-based relational or multi-object understanding tasks within and beyond DSGG frameworks.

Pseudocode Illustration

for each video:
  for t in 1..T:
    detect objects → features {v_t^i, d_t^i, b_t^i}
    build relation set X_t = [x_t^k] for all pairs
    Z_s = [c_t; X_t]                              # Spatial input
    Z_s_hat = Transformer_s(Z_s)                  # Spatial encoder
    save c_t_hat = Z_s_hat[0], x_t_hat = Z_s_hat[1:]
  Z_temp = [c_tmp; c_1_hat; …; c_T_hat]           # Temporal input
  Z_temp_hat = Transformer_t(Z_temp + E_pos)      # Temporal encoder
  c_tmp_hat = Z_temp_hat[0]
  for all relations x_t_hat[k]:
    p_tk = softmax( MLP( concat(x_t_hat[k], c_tmp_hat) ) )
    fused_score = base_triplet_score * p_tk[0]
    predicate_logits = existing_heads( concat(r_tk, c_tmp_hat) )
    compute losses: od, rel, RSN

RS-Net’s core contribution is a unified, context-sensitive scoring mechanism for pairwise relations in DSGG, facilitating robust, efficient, and accurate video scene understanding in the presence of challenging data distributions.

PDF Markdown Chat (Pro)

References (1)

RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Relation Scoring Network (RS-Net).