HRRP-T: Hierarchical Reward Propagation for VLMs

Updated 3 December 2025

The paper introduces HRRP-T, a structured training paradigm that combines multi-level reward propagation with temporal consistency to enhance VLM performance on road scenes.
It employs graph-based hierarchical modeling across scene, relational, and semantic levels to enforce intra-frame logical constraints and improve mid-level attribute predictions.
Temporal rewards over video clips ensure smooth transitions and legally consistent attribute changes, leading to significant gains in precision and recall.

Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T) is a structured training paradigm for Vision-LLMs (VLMs) introduced to advance mid-level road scene understanding, particularly in vision-based autonomous driving contexts. HRRP-T addresses the need for models that can reason about the logical and geometric structure of road environments beyond basic perception tasks, applying multi-level reward propagation and enforcing temporal consistency over short video clips. The method is described in the context of the RoadSceneBench dataset (Liu et al., 27 Nov 2025).

1. Graph-Based Hierarchical Relational Modeling

HRRP-T represents each image frame in a video clip as a scene graph comprising six domain-relevant mid-level attributes—lane count, ego-lane index, junction/entrance/exit recognition, lane-change feasibility, traffic condition, and scene type (urban/suburb/highway). These nodes are organized into three distinct hierarchy levels:

Scene-Level ( $l=1$ ): Nodes $(v_1, v_2)$ correspond to “lane count” and “ego-lane index,” with edges encoding the constraint that the number of lanes bounds the legal index of the ego vehicle. The affinity matrix $A^{(1)} \in \mathbb{R}^{2 \times 2}$ models logical dependencies, e.g., stress propagation for lane count mis-estimation to ego-lane index predictions.
Relational-Level ( $l=2$ ): Nodes $(v_3, v_4)$ capture “junction/entrance/exit” and “lane-change feasibility.” Edges reflect constraints such as lane-change legality depending on detected junctions or lane count; $A^{(2)}$ encodes these relational priors.
Semantic-Level ( $l=3$ ): Nodes $(v_5, v_6)$ represent “traffic condition” and “scene type.” Edges connect global scene type to expected congestion patterns (e.g., highways vs. urban streets), assembled in $A^{(3)}$ .

Inter-level edges link nodes across levels, e.g., from scene-level to relational-level nodes. The full reward propagation graph can be assembled as a block-sparse $6\times6$ affinity matrix, but HRRP-T maintains explicit separation for interpretability.

2. Hierarchical Reward Propagation Formalism

Each frame $t$ yields predictions $y_t = \{y_{t,i}\}_{i=1}^6$ for the six node attributes. Frame-level correctness is assessed via three scalar rewards:

$R_{\text{sce}}^t$ : Scene-level correctness, average over two nodes.
$R_{\text{rel}}^t$ : Relational-level correctness, average over two nodes.
$R_{\text{sem}}^t$ : Semantic-level correctness, average over two nodes.

Rewards are aggregated as:

$\mathcal{R}_{\text{frame}}^t = \alpha R_{\text{sce}}^t + \beta R_{\text{rel}}^t + \gamma R_{\text{sem}}^t,\quad \alpha + \beta + \gamma = 1$

Typical hyper-parameters: $\alpha=0.4$ , $\beta=0.3$ , $\gamma=0.3$ . Each $R^t_{*}$ is computed as a mean of node-wise indicator rewards, with optional adjacency-weighted aggregation (e.g., using $r^{(1)} = (I + A^{(1)})r^{(1)}_{\text{raw}}$ to propagate errors along graph structure). In HRRP-T, a weighted sum mechanism was favored for simplicity and effectiveness.

3. Temporal Consistency Mechanism

HRRP-T explicitly enhances temporal reliability by computing additional video clip-level reward terms over $T=5$ frames:

Smoothness Reward ( $\mathcal{R}_{\text{smooth}}$ ):

$\mathcal{R}_{\text{smooth}} = 1 - \frac{1}{T-1} \sum_{t=2}^{T} |y_t - y_{t-1}|$

Penalizes abrupt changes in ordinal labels (lane count, ego-lane index).

Plausibility Reward ( $\mathcal{R}_{\text{plausible}}$ ):

$\mathcal{R}_{\text{plausible}} = \frac{1}{T-1} \sum_{t=1}^{T-1} \mathbb{I}[V(y_t, y_{t+1})]$

Where $V(y_t, y_{t+1})$ confirms that transitions between frames conform to physical or legal domain rules.

Combined temporal reward is:

$\mathcal{R}_{\text{temporal}} = \lambda \mathcal{R}_{\text{smooth}} + (1-\lambda)\mathcal{R}_{\text{plausible}},\quad \lambda \in [0,1]$

Final clip-level reward:

$\mathcal{R}_{\text{HRRP-T}} = \lambda_{\text{frame}}\,\textstyle\frac{1}{T}\sum_{t=1}^T \mathcal{R}_{\text{frame}}^t + \lambda_{\text{temporal}}\,\mathcal{R}_{\text{temporal}}$

Empirical values: $\lambda_{\text{frame}}=0.7$ , $\lambda_{\text{temporal}}=0.3$ .

4. Combined Training Objective and Optimization

HRRP-T integrates standard supervised fine-tuning (SFT) with policy gradient reinforcement learning (GRPO/PPO) to maximize hierarchical and temporal rewards. The total training loss is:

$L_{\text{total}} = L_{\mathrm{SFT}} + \eta L_{\mathrm{RL}} + \zeta L_{\mathrm{clip}}$

Where:

$L_{\mathrm{SFT}} = -\sum_{t=1}^{T} \sum_{i=1}^{6} \log p_\theta(y_{t,i}^* | x_{1:t})$ (cross-entropy over six questions per frame)
$L_{\mathrm{RL}} = -\mathbb{E}_{\pi_\theta} [\mathcal{R}_{\text{HRRP-T}}]$ (negative expected hierarchical-temporal reward)
$L_{\mathrm{clip}}$ is the PPO KL or ratio clipping penalty

Typical coefficients: $\eta \approx 1.0$ , $\zeta \approx 0.1$ . Model parameters ( $\theta$ ) correspond to LoRA adapters on the Qwen2.5-VL-7B backbones.

The RL stage involves rolling out batches of $K$ video clips, producing policy predictions for all nodes at all frames, computing all hierarchical and temporal rewards, and propagating gradients into adapter parameters.

5. Implementation Specifications

HRRP-T is implemented on the Qwen2.5-VL-7B architecture with frozen backbone weights and LoRA adapters ( $r=8$ ) installed on cross-attention layers. Training is staged:

Stage 1: Supervised Fine-Tuning (5 epochs, batch size 64, learning rate $1 \times 10^{-4}$ , weight decay $1 \times 10^{-5}$ ). Each sample is one image with six templated attribute questions.
Stage 2: HRRP-T RL Optimization (GRPO/PPO with clip $\epsilon=0.1$ , value learning rate $5 \times 10^{-5}$ , around $100$K steps, 4 $\times$ NVIDIA A800 GPUs). Each step processes $T=5$ frames and all reward terms.

Hyper-parameters for reward weighting: $\alpha=0.4$ , $\beta=0.3$ , $\gamma=0.3$ ; $\lambda=0.5$ ; $\lambda_{\text{frame}}=0.7$ ; $\lambda_{\text{temporal}}=0.3$ .

6. Experimental Outcomes and Comparative Analysis

Experiments were conducted on the RoadSceneBench dataset comprising $2,\!341$ clips (each with 5 frames, $11,\!705$ images, $\approx\!163,\!000$ labels). Tasks spanned lane count, ego-lane index, junction/entrance/exit, lane-change feasibility, traffic condition, and scene type.

The evaluation used per-task precision (P) and recall (R), reporting "overall P/R." Baselines included closed-source systems (Gemini-2.5-Pro, GPT-4o, Claude-3.7) and open-source VLMs.

Method	Overall Precision (%)	Overall Recall (%)
Gemini-2.5-Pro	60.61	52.70
MapVLM (SFT only)	72.14	67.25
MapVLM (SFT + HRRP-T)	75.78	72.17

Notable taskwise improvements:

Ego-lane Index Recall improved from 50.37% to 84.67%.
Lane-Change Feasibility maintained P/R above 83%.

Ablation studies showed $+3.6$ percentage points in overall precision and $+4.9$ in recall from SFT to SFT+HRRP-T, with greatest gains in temporally sensitive attributes. Qualitative analyses indicated enhanced logical and topological consistency under occlusions versus standard SFT-only training, with HRRP-T maintaining stable lane topology and ego-lane index assignments.

This suggests that hierarchical frame-level rewards enforce intra-frame logical coherence, temporal rewards address frame-to-frame flicker, and their integration via GRPO/PPO significantly elevates model performance on structure-aware road scene understanding benchmarks (Liu et al., 27 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T).