Papers
Topics
Authors
Recent
2000 character limit reached

HRRP-T: Hierarchical Reward Propagation for VLMs

Updated 3 December 2025
  • The paper introduces HRRP-T, a structured training paradigm that combines multi-level reward propagation with temporal consistency to enhance VLM performance on road scenes.
  • It employs graph-based hierarchical modeling across scene, relational, and semantic levels to enforce intra-frame logical constraints and improve mid-level attribute predictions.
  • Temporal rewards over video clips ensure smooth transitions and legally consistent attribute changes, leading to significant gains in precision and recall.

Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T) is a structured training paradigm for Vision-LLMs (VLMs) introduced to advance mid-level road scene understanding, particularly in vision-based autonomous driving contexts. HRRP-T addresses the need for models that can reason about the logical and geometric structure of road environments beyond basic perception tasks, applying multi-level reward propagation and enforcing temporal consistency over short video clips. The method is described in the context of the RoadSceneBench dataset (Liu et al., 27 Nov 2025).

1. Graph-Based Hierarchical Relational Modeling

HRRP-T represents each image frame in a video clip as a scene graph comprising six domain-relevant mid-level attributes—lane count, ego-lane index, junction/entrance/exit recognition, lane-change feasibility, traffic condition, and scene type (urban/suburb/highway). These nodes are organized into three distinct hierarchy levels:

  • Scene-Level (l=1l=1): Nodes (v1,v2)(v_1, v_2) correspond to “lane count” and “ego-lane index,” with edges encoding the constraint that the number of lanes bounds the legal index of the ego vehicle. The affinity matrix A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2} models logical dependencies, e.g., stress propagation for lane count mis-estimation to ego-lane index predictions.
  • Relational-Level (l=2l=2): Nodes (v3,v4)(v_3, v_4) capture “junction/entrance/exit” and “lane-change feasibility.” Edges reflect constraints such as lane-change legality depending on detected junctions or lane count; A(2)A^{(2)} encodes these relational priors.
  • Semantic-Level (l=3l=3): Nodes (v5,v6)(v_5, v_6) represent “traffic condition” and “scene type.” Edges connect global scene type to expected congestion patterns (e.g., highways vs. urban streets), assembled in A(3)A^{(3)}.

Inter-level edges link nodes across levels, e.g., from scene-level to relational-level nodes. The full reward propagation graph can be assembled as a block-sparse 6×66\times6 affinity matrix, but HRRP-T maintains explicit separation for interpretability.

2. Hierarchical Reward Propagation Formalism

Each frame tt yields predictions yt={yt,i}i=16y_t = \{y_{t,i}\}_{i=1}^6 for the six node attributes. Frame-level correctness is assessed via three scalar rewards:

  • RscetR_{\text{sce}}^t: Scene-level correctness, average over two nodes.
  • RreltR_{\text{rel}}^t: Relational-level correctness, average over two nodes.
  • RsemtR_{\text{sem}}^t: Semantic-level correctness, average over two nodes.

Rewards are aggregated as:

Rframet=αRscet+βRrelt+γRsemt,α+β+γ=1\mathcal{R}_{\text{frame}}^t = \alpha R_{\text{sce}}^t + \beta R_{\text{rel}}^t + \gamma R_{\text{sem}}^t,\quad \alpha + \beta + \gamma = 1

Typical hyper-parameters: α=0.4\alpha=0.4, β=0.3\beta=0.3, γ=0.3\gamma=0.3. Each RtR^t_{*} is computed as a mean of node-wise indicator rewards, with optional adjacency-weighted aggregation (e.g., using r(1)=(I+A(1))rraw(1)r^{(1)} = (I + A^{(1)})r^{(1)}_{\text{raw}} to propagate errors along graph structure). In HRRP-T, a weighted sum mechanism was favored for simplicity and effectiveness.

3. Temporal Consistency Mechanism

HRRP-T explicitly enhances temporal reliability by computing additional video clip-level reward terms over T=5T=5 frames:

  • Smoothness Reward (Rsmooth\mathcal{R}_{\text{smooth}}):

Rsmooth=11T1t=2Tytyt1\mathcal{R}_{\text{smooth}} = 1 - \frac{1}{T-1} \sum_{t=2}^{T} |y_t - y_{t-1}|

Penalizes abrupt changes in ordinal labels (lane count, ego-lane index).

  • Plausibility Reward (Rplausible\mathcal{R}_{\text{plausible}}):

Rplausible=1T1t=1T1I[V(yt,yt+1)]\mathcal{R}_{\text{plausible}} = \frac{1}{T-1} \sum_{t=1}^{T-1} \mathbb{I}[V(y_t, y_{t+1})]

Where V(yt,yt+1)V(y_t, y_{t+1}) confirms that transitions between frames conform to physical or legal domain rules.

Combined temporal reward is:

Rtemporal=λRsmooth+(1λ)Rplausible,λ[0,1]\mathcal{R}_{\text{temporal}} = \lambda \mathcal{R}_{\text{smooth}} + (1-\lambda)\mathcal{R}_{\text{plausible}},\quad \lambda \in [0,1]

Final clip-level reward:

RHRRP-T=λframe1Tt=1TRframet+λtemporalRtemporal\mathcal{R}_{\text{HRRP-T}} = \lambda_{\text{frame}}\,\textstyle\frac{1}{T}\sum_{t=1}^T \mathcal{R}_{\text{frame}}^t + \lambda_{\text{temporal}}\,\mathcal{R}_{\text{temporal}}

Empirical values: λframe=0.7\lambda_{\text{frame}}=0.7, λtemporal=0.3\lambda_{\text{temporal}}=0.3.

4. Combined Training Objective and Optimization

HRRP-T integrates standard supervised fine-tuning (SFT) with policy gradient reinforcement learning (GRPO/PPO) to maximize hierarchical and temporal rewards. The total training loss is:

Ltotal=LSFT+ηLRL+ζLclipL_{\text{total}} = L_{\mathrm{SFT}} + \eta L_{\mathrm{RL}} + \zeta L_{\mathrm{clip}}

Where:

  • LSFT=t=1Ti=16logpθ(yt,ix1:t)L_{\mathrm{SFT}} = -\sum_{t=1}^{T} \sum_{i=1}^{6} \log p_\theta(y_{t,i}^* | x_{1:t}) (cross-entropy over six questions per frame)
  • LRL=Eπθ[RHRRP-T]L_{\mathrm{RL}} = -\mathbb{E}_{\pi_\theta} [\mathcal{R}_{\text{HRRP-T}}] (negative expected hierarchical-temporal reward)
  • LclipL_{\mathrm{clip}} is the PPO KL or ratio clipping penalty

Typical coefficients: η1.0\eta \approx 1.0, ζ0.1\zeta \approx 0.1. Model parameters (θ\theta) correspond to LoRA adapters on the Qwen2.5-VL-7B backbones.

The RL stage involves rolling out batches of KK video clips, producing policy predictions for all nodes at all frames, computing all hierarchical and temporal rewards, and propagating gradients into adapter parameters.

5. Implementation Specifications

HRRP-T is implemented on the Qwen2.5-VL-7B architecture with frozen backbone weights and LoRA adapters (r=8r=8) installed on cross-attention layers. Training is staged:

  • Stage 1: Supervised Fine-Tuning (5 epochs, batch size 64, learning rate 1×1041 \times 10^{-4}, weight decay 1×1051 \times 10^{-5}). Each sample is one image with six templated attribute questions.
  • Stage 2: HRRP-T RL Optimization (GRPO/PPO with clip ϵ=0.1\epsilon=0.1, value learning rate 5×1055 \times 10^{-5}, around $100$K steps, 4 ×\times NVIDIA A800 GPUs). Each step processes T=5T=5 frames and all reward terms.

Hyper-parameters for reward weighting: α=0.4\alpha=0.4, β=0.3\beta=0.3, γ=0.3\gamma=0.3; λ=0.5\lambda=0.5; λframe=0.7\lambda_{\text{frame}}=0.7; λtemporal=0.3\lambda_{\text{temporal}}=0.3.

6. Experimental Outcomes and Comparative Analysis

Experiments were conducted on the RoadSceneBench dataset comprising 2, ⁣3412,\!341 clips (each with 5 frames, 11, ⁣70511,\!705 images,  ⁣163, ⁣000\approx\!163,\!000 labels). Tasks spanned lane count, ego-lane index, junction/entrance/exit, lane-change feasibility, traffic condition, and scene type.

The evaluation used per-task precision (P) and recall (R), reporting "overall P/R." Baselines included closed-source systems (Gemini-2.5-Pro, GPT-4o, Claude-3.7) and open-source VLMs.

Method Overall Precision (%) Overall Recall (%)
Gemini-2.5-Pro 60.61 52.70
MapVLM (SFT only) 72.14 67.25
MapVLM (SFT + HRRP-T) 75.78 72.17

Notable taskwise improvements:

  • Ego-lane Index Recall improved from 50.37% to 84.67%.
  • Lane-Change Feasibility maintained P/R above 83%.

Ablation studies showed +3.6+3.6 percentage points in overall precision and +4.9+4.9 in recall from SFT to SFT+HRRP-T, with greatest gains in temporally sensitive attributes. Qualitative analyses indicated enhanced logical and topological consistency under occlusions versus standard SFT-only training, with HRRP-T maintaining stable lane topology and ego-lane index assignments.

This suggests that hierarchical frame-level rewards enforce intra-frame logical coherence, temporal rewards address frame-to-frame flicker, and their integration via GRPO/PPO significantly elevates model performance on structure-aware road scene understanding benchmarks (Liu et al., 27 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T).