HRRP-T: Hierarchical Reward Propagation for VLMs
- The paper introduces HRRP-T, a structured training paradigm that combines multi-level reward propagation with temporal consistency to enhance VLM performance on road scenes.
- It employs graph-based hierarchical modeling across scene, relational, and semantic levels to enforce intra-frame logical constraints and improve mid-level attribute predictions.
- Temporal rewards over video clips ensure smooth transitions and legally consistent attribute changes, leading to significant gains in precision and recall.
Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T) is a structured training paradigm for Vision-LLMs (VLMs) introduced to advance mid-level road scene understanding, particularly in vision-based autonomous driving contexts. HRRP-T addresses the need for models that can reason about the logical and geometric structure of road environments beyond basic perception tasks, applying multi-level reward propagation and enforcing temporal consistency over short video clips. The method is described in the context of the RoadSceneBench dataset (Liu et al., 27 Nov 2025).
1. Graph-Based Hierarchical Relational Modeling
HRRP-T represents each image frame in a video clip as a scene graph comprising six domain-relevant mid-level attributes—lane count, ego-lane index, junction/entrance/exit recognition, lane-change feasibility, traffic condition, and scene type (urban/suburb/highway). These nodes are organized into three distinct hierarchy levels:
- Scene-Level (): Nodes correspond to “lane count” and “ego-lane index,” with edges encoding the constraint that the number of lanes bounds the legal index of the ego vehicle. The affinity matrix models logical dependencies, e.g., stress propagation for lane count mis-estimation to ego-lane index predictions.
- Relational-Level (): Nodes capture “junction/entrance/exit” and “lane-change feasibility.” Edges reflect constraints such as lane-change legality depending on detected junctions or lane count; encodes these relational priors.
- Semantic-Level (): Nodes represent “traffic condition” and “scene type.” Edges connect global scene type to expected congestion patterns (e.g., highways vs. urban streets), assembled in .
Inter-level edges link nodes across levels, e.g., from scene-level to relational-level nodes. The full reward propagation graph can be assembled as a block-sparse affinity matrix, but HRRP-T maintains explicit separation for interpretability.
2. Hierarchical Reward Propagation Formalism
Each frame yields predictions for the six node attributes. Frame-level correctness is assessed via three scalar rewards:
- : Scene-level correctness, average over two nodes.
- : Relational-level correctness, average over two nodes.
- : Semantic-level correctness, average over two nodes.
Rewards are aggregated as:
Typical hyper-parameters: , , . Each is computed as a mean of node-wise indicator rewards, with optional adjacency-weighted aggregation (e.g., using to propagate errors along graph structure). In HRRP-T, a weighted sum mechanism was favored for simplicity and effectiveness.
3. Temporal Consistency Mechanism
HRRP-T explicitly enhances temporal reliability by computing additional video clip-level reward terms over frames:
- Smoothness Reward ():
Penalizes abrupt changes in ordinal labels (lane count, ego-lane index).
- Plausibility Reward ():
Where confirms that transitions between frames conform to physical or legal domain rules.
Combined temporal reward is:
Final clip-level reward:
Empirical values: , .
4. Combined Training Objective and Optimization
HRRP-T integrates standard supervised fine-tuning (SFT) with policy gradient reinforcement learning (GRPO/PPO) to maximize hierarchical and temporal rewards. The total training loss is:
Where:
- (cross-entropy over six questions per frame)
- (negative expected hierarchical-temporal reward)
- is the PPO KL or ratio clipping penalty
Typical coefficients: , . Model parameters () correspond to LoRA adapters on the Qwen2.5-VL-7B backbones.
The RL stage involves rolling out batches of video clips, producing policy predictions for all nodes at all frames, computing all hierarchical and temporal rewards, and propagating gradients into adapter parameters.
5. Implementation Specifications
HRRP-T is implemented on the Qwen2.5-VL-7B architecture with frozen backbone weights and LoRA adapters () installed on cross-attention layers. Training is staged:
- Stage 1: Supervised Fine-Tuning (5 epochs, batch size 64, learning rate , weight decay ). Each sample is one image with six templated attribute questions.
- Stage 2: HRRP-T RL Optimization (GRPO/PPO with clip , value learning rate , around $100$K steps, 4 NVIDIA A800 GPUs). Each step processes frames and all reward terms.
Hyper-parameters for reward weighting: , , ; ; ; .
6. Experimental Outcomes and Comparative Analysis
Experiments were conducted on the RoadSceneBench dataset comprising clips (each with 5 frames, images, labels). Tasks spanned lane count, ego-lane index, junction/entrance/exit, lane-change feasibility, traffic condition, and scene type.
The evaluation used per-task precision (P) and recall (R), reporting "overall P/R." Baselines included closed-source systems (Gemini-2.5-Pro, GPT-4o, Claude-3.7) and open-source VLMs.
| Method | Overall Precision (%) | Overall Recall (%) |
|---|---|---|
| Gemini-2.5-Pro | 60.61 | 52.70 |
| MapVLM (SFT only) | 72.14 | 67.25 |
| MapVLM (SFT + HRRP-T) | 75.78 | 72.17 |
Notable taskwise improvements:
- Ego-lane Index Recall improved from 50.37% to 84.67%.
- Lane-Change Feasibility maintained P/R above 83%.
Ablation studies showed percentage points in overall precision and in recall from SFT to SFT+HRRP-T, with greatest gains in temporally sensitive attributes. Qualitative analyses indicated enhanced logical and topological consistency under occlusions versus standard SFT-only training, with HRRP-T maintaining stable lane topology and ego-lane index assignments.
This suggests that hierarchical frame-level rewards enforce intra-frame logical coherence, temporal rewards address frame-to-frame flicker, and their integration via GRPO/PPO significantly elevates model performance on structure-aware road scene understanding benchmarks (Liu et al., 27 Nov 2025).