Papers
Topics
Authors
Recent
Search
2000 character limit reached

HRRP-T: Hierarchical Reward Propagation for VLMs

Updated 3 December 2025
  • The paper introduces HRRP-T, a structured training paradigm that combines multi-level reward propagation with temporal consistency to enhance VLM performance on road scenes.
  • It employs graph-based hierarchical modeling across scene, relational, and semantic levels to enforce intra-frame logical constraints and improve mid-level attribute predictions.
  • Temporal rewards over video clips ensure smooth transitions and legally consistent attribute changes, leading to significant gains in precision and recall.

Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T) is a structured training paradigm for Vision-LLMs (VLMs) introduced to advance mid-level road scene understanding, particularly in vision-based autonomous driving contexts. HRRP-T addresses the need for models that can reason about the logical and geometric structure of road environments beyond basic perception tasks, applying multi-level reward propagation and enforcing temporal consistency over short video clips. The method is described in the context of the RoadSceneBench dataset (Liu et al., 27 Nov 2025).

1. Graph-Based Hierarchical Relational Modeling

HRRP-T represents each image frame in a video clip as a scene graph comprising six domain-relevant mid-level attributes—lane count, ego-lane index, junction/entrance/exit recognition, lane-change feasibility, traffic condition, and scene type (urban/suburb/highway). These nodes are organized into three distinct hierarchy levels:

  • Scene-Level (l=1l=1): Nodes (v1,v2)(v_1, v_2) correspond to “lane count” and “ego-lane index,” with edges encoding the constraint that the number of lanes bounds the legal index of the ego vehicle. The affinity matrix A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2} models logical dependencies, e.g., stress propagation for lane count mis-estimation to ego-lane index predictions.
  • Relational-Level (l=2l=2): Nodes (v3,v4)(v_3, v_4) capture “junction/entrance/exit” and “lane-change feasibility.” Edges reflect constraints such as lane-change legality depending on detected junctions or lane count; A(2)A^{(2)} encodes these relational priors.
  • Semantic-Level (l=3l=3): Nodes (v5,v6)(v_5, v_6) represent “traffic condition” and “scene type.” Edges connect global scene type to expected congestion patterns (e.g., highways vs. urban streets), assembled in A(3)A^{(3)}.

Inter-level edges link nodes across levels, e.g., from scene-level to relational-level nodes. The full reward propagation graph can be assembled as a block-sparse 6×66\times6 affinity matrix, but HRRP-T maintains explicit separation for interpretability.

2. Hierarchical Reward Propagation Formalism

Each frame (v1,v2)(v_1, v_2)0 yields predictions (v1,v2)(v_1, v_2)1 for the six node attributes. Frame-level correctness is assessed via three scalar rewards:

  • (v1,v2)(v_1, v_2)2: Scene-level correctness, average over two nodes.
  • (v1,v2)(v_1, v_2)3: Relational-level correctness, average over two nodes.
  • (v1,v2)(v_1, v_2)4: Semantic-level correctness, average over two nodes.

Rewards are aggregated as:

(v1,v2)(v_1, v_2)5

Typical hyper-parameters: (v1,v2)(v_1, v_2)6, (v1,v2)(v_1, v_2)7, (v1,v2)(v_1, v_2)8. Each (v1,v2)(v_1, v_2)9 is computed as a mean of node-wise indicator rewards, with optional adjacency-weighted aggregation (e.g., using A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}0 to propagate errors along graph structure). In HRRP-T, a weighted sum mechanism was favored for simplicity and effectiveness.

3. Temporal Consistency Mechanism

HRRP-T explicitly enhances temporal reliability by computing additional video clip-level reward terms over A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}1 frames:

  • Smoothness Reward (A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}2):

A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}3

Penalizes abrupt changes in ordinal labels (lane count, ego-lane index).

  • Plausibility Reward (A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}4):

A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}5

Where A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}6 confirms that transitions between frames conform to physical or legal domain rules.

Combined temporal reward is:

A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}7

Final clip-level reward:

A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}8

Empirical values: A(1)R2×2A^{(1)} \in \mathbb{R}^{2 \times 2}9, l=2l=20.

4. Combined Training Objective and Optimization

HRRP-T integrates standard supervised fine-tuning (SFT) with policy gradient reinforcement learning (GRPO/PPO) to maximize hierarchical and temporal rewards. The total training loss is:

l=2l=21

Where:

  • l=2l=22 (cross-entropy over six questions per frame)
  • l=2l=23 (negative expected hierarchical-temporal reward)
  • l=2l=24 is the PPO KL or ratio clipping penalty

Typical coefficients: l=2l=25, l=2l=26. Model parameters (l=2l=27) correspond to LoRA adapters on the Qwen2.5-VL-7B backbones.

The RL stage involves rolling out batches of l=2l=28 video clips, producing policy predictions for all nodes at all frames, computing all hierarchical and temporal rewards, and propagating gradients into adapter parameters.

5. Implementation Specifications

HRRP-T is implemented on the Qwen2.5-VL-7B architecture with frozen backbone weights and LoRA adapters (l=2l=29) installed on cross-attention layers. Training is staged:

  • Stage 1: Supervised Fine-Tuning (5 epochs, batch size 64, learning rate (v3,v4)(v_3, v_4)0, weight decay (v3,v4)(v_3, v_4)1). Each sample is one image with six templated attribute questions.
  • Stage 2: HRRP-T RL Optimization (GRPO/PPO with clip (v3,v4)(v_3, v_4)2, value learning rate (v3,v4)(v_3, v_4)3, around (v3,v4)(v_3, v_4)4K steps, 4 (v3,v4)(v_3, v_4)5 NVIDIA A800 GPUs). Each step processes (v3,v4)(v_3, v_4)6 frames and all reward terms.

Hyper-parameters for reward weighting: (v3,v4)(v_3, v_4)7, (v3,v4)(v_3, v_4)8, (v3,v4)(v_3, v_4)9; A(2)A^{(2)}0; A(2)A^{(2)}1; A(2)A^{(2)}2.

6. Experimental Outcomes and Comparative Analysis

Experiments were conducted on the RoadSceneBench dataset comprising A(2)A^{(2)}3 clips (each with 5 frames, A(2)A^{(2)}4 images, A(2)A^{(2)}5 labels). Tasks spanned lane count, ego-lane index, junction/entrance/exit, lane-change feasibility, traffic condition, and scene type.

The evaluation used per-task precision (P) and recall (R), reporting "overall P/R." Baselines included closed-source systems (Gemini-2.5-Pro, GPT-4o, Claude-3.7) and open-source VLMs.

Method Overall Precision (%) Overall Recall (%)
Gemini-2.5-Pro 60.61 52.70
MapVLM (SFT only) 72.14 67.25
MapVLM (SFT + HRRP-T) 75.78 72.17

Notable taskwise improvements:

  • Ego-lane Index Recall improved from 50.37% to 84.67%.
  • Lane-Change Feasibility maintained P/R above 83%.

Ablation studies showed A(2)A^{(2)}6 percentage points in overall precision and A(2)A^{(2)}7 in recall from SFT to SFT+HRRP-T, with greatest gains in temporally sensitive attributes. Qualitative analyses indicated enhanced logical and topological consistency under occlusions versus standard SFT-only training, with HRRP-T maintaining stable lane topology and ego-lane index assignments.

This suggests that hierarchical frame-level rewards enforce intra-frame logical coherence, temporal rewards address frame-to-frame flicker, and their integration via GRPO/PPO significantly elevates model performance on structure-aware road scene understanding benchmarks (Liu et al., 27 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T).