RS-Net: Context-Aware Relation Scoring
- RS-Net is a modular, context-aware scoring framework that evaluates object relations in videos by integrating spatial interactions and temporal context.
- It employs Transformer-based encoders to capture intra-frame and inter-frame cues, thereby improving relation classification and mitigating long-tail effects.
- RS-Net integrates seamlessly with existing DSGG backbones and demonstrates measurable gains in recall, precision, and mean recall on the Action Genome dataset.
Relation Scoring Network (RS-Net) is a modular context-aware scoring framework designed to improve Dynamic Scene Graph Generation (DSGG) in videos by learning to score the “meaningfulness” of object pairs based on both spatial interactions and temporally aggregated video context. Distinct from previous approaches, RS-Net directly models the distinction between meaningful and irrelevant pairs, integrating contextual scoring via spatial and temporal Transformer-based encoders. The design enables seamless integration into existing DSGG backbones with minimal architectural changes, yielding improvements in Recall, Precision, and mean Recall, especially in the presence of long-tailed relation distributions. RS-Net was introduced and evaluated on the Action Genome dataset, demonstrating both empirical efficacy and computational efficiency (Jo et al., 11 Nov 2025).
1. Dynamic Scene Graph Generation and Problem Setup
In DSGG, the goal is to produce, for a sequence of video frames , a corresponding sequence of scene graphs: where is the set of detected objects (nodes) in frame , and $E_t = \{(i, j, r^k_t)\}_{k=1}^{K(t)}}$ is the set of K(t) subject–predicate–object (triplet) relations with labels .
Each object at time is assigned:
- Visual feature ,
- Bounding box ,
- Category distribution .
For each ordered pair , a relation representation is constructed, and each possible predicate class receives a predicate score . Conventional approaches lack explicit discrimination between related and unrelated pairs, limiting their ability to suppress semantically vacuous predictions during inference. RS-Net addresses this by computing a contextual “relation score” for all pairs and integrating this score into downstream triplet predictions.
2. Spatial and Temporal Context Encoders
2.1 Spatial Context Encoder
The spatial encoder captures intra-frame contextual cues via Transformer-based self-attention over relation features.
- Relation features for frame :
where “bar” denotes learned linear projection and is the union-box RoI feature.
- The feature sequence is prepended with a learnable [Spa] context token :
processed by Transformer layers:
yielding an updated context token and enriched per-pair relation features .
2.2 Temporal Context Encoder
To aggregate video-level context, the spatial context tokens are stacked across the sequence:
- Temporal input sequence:
where is a learnable [Tmp] token.
- After adding a learnable positional embedding , the sequence is processed via Transformer layers:
with the output serving as a video-level context token.
3. Unified Triplet Scoring and Losses
3.1 Relation Scoring Decoder
For each enriched pair , the decoder concatenates this feature with and applies a small MLP followed by softmax: where quantifies the “meaningfulness” and the “irrelevance” of relation at time .
3.2 Triplet Score Fusion
For standard DSGG, the base confidence for triplet is given by: where and are subject/object detection confidences, and is the per-predicate score. RS-Net injects contextual awareness by fusing its score: which serves to suppress spurious triplets and emphasize contextually relevant ones.
3.3 Loss Functions
The total training loss is a sum of three terms:
- Object detection:
- Predicate classification (multi-label pairwise ranking):
- Relation scoring (focal loss):
with and focusing parameter to balance class prevalence and focus.
4. Integration with Dynamic Scene Graph Generation Frameworks
RS-Net is architected for modularity:
- At inference, any DSGG backbone supplies per-relation feature embeddings .
- Video context () is concatenated: and fed to existing predicate classification heads.
- In parallel, the RS-Net MLP provides scores , whose meaningfulness probability is multiplied into the final triplet score.
This procedure requires no modification to the object detector or scene graph construction logic. It is compatible with various DSGG backbones, including STTran, STKET, and DSG-DETR, among others.
5. Experimental Evaluation
On the Action Genome benchmark, RS-Net demonstrates consistent gains in recall and precision, especially in mean Recall (mR), which is critical for long-tailed relation distributions.
| Backbone | R@10 (Baseline → RS-Net) | P@10 (Baseline → RS-Net) | mR@10 (STTran) |
|---|---|---|---|
| STTran | 25.1 → 28.3 (+3.2) | 17.9 → 20.7 (+2.8) | +2.4 |
| STKET | 26.4 → 28.9 (+2.5) | 18.9 → 21.2 (+2.3) | - |
| DSG-DETR | 30.3 → 30.5 (+0.2) | 22.1 → 22.2 (+0.1) | - |
Additional results:
- SGCLS: DSG-DETR R@10 = 49.9 → 50.5, P@10 = 56.6 → 57.0.
- Ablation: Removal of the temporal encoder reduces SGDET R@10 from 28.3 to 28.0.
- Learnable token outperforms mean-pooling (28.3 vs. 28.0 R@10).
- Context-fusion produces small but consistent improvements.
- Precision and FPS remain competitive: e.g., STTran increases from 0.74 to 0.75 FPS after RS-Net integration.
6. Computational Considerations
The addition of RS-Net raises parameter counts (e.g., STTran grows from 126.3M to 158.6M parameters, a +32M increase), with minimal or slightly improved frame-per-second throughput. Negative sampling in the RSN loss speeds convergence by reducing exposure to noisy negatives, and GPU parallel efficiency benefits from aligned tensor shapes.
7. Analysis, Limitations, and Extensions
RS-Net’s explicit learning of “meaningful vs. irrelevant” relations mitigates long-tail effects by down-weighting predominant but semantically empty co-occurrences. The approach is modular, back-end agnostic, and leverages both intra-frame and inter-frame cues. A limitation is the isolated treatment of relations, suggesting future incorporation of graph propagation. Potential extensions include dynamic temporal windows, incorporation of cross-video context, and unsupervised relation mining. A plausible implication is that RS-Net’s context-driven score may benefit other video-based relational or multi-object understanding tasks within and beyond DSGG frameworks.
Pseudocode Illustration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
for each video:
for t in 1..T:
detect objects → features {v_t^i, d_t^i, b_t^i}
build relation set X_t = [x_t^k] for all pairs
Z_s = [c_t; X_t] # Spatial input
Z_s_hat = Transformer_s(Z_s) # Spatial encoder
save c_t_hat = Z_s_hat[0], x_t_hat = Z_s_hat[1:]
Z_temp = [c_tmp; c_1_hat; …; c_T_hat] # Temporal input
Z_temp_hat = Transformer_t(Z_temp + E_pos) # Temporal encoder
c_tmp_hat = Z_temp_hat[0]
for all relations x_t_hat[k]:
p_tk = softmax( MLP( concat(x_t_hat[k], c_tmp_hat) ) )
fused_score = base_triplet_score * p_tk[0]
predicate_logits = existing_heads( concat(r_tk, c_tmp_hat) )
compute losses: od, rel, RSN |
RS-Net’s core contribution is a unified, context-sensitive scoring mechanism for pairwise relations in DSGG, facilitating robust, efficient, and accurate video scene understanding in the presence of challenging data distributions.