Vision-Language Model Feedback
- VLM Feedback is a system that uses model-generated absolute ratings to supervise reinforcement learning, reducing dependence on human feedback.
- The approach employs stratified sampling, inverse frequency weighting, and mean absolute error loss to enhance reward model robustness and sample efficiency.
- Empirical results, exemplified by the ERL-VLM framework, show faster convergence and higher success rates compared to pairwise methods and CLIP-based baselines.
A Vision-LLM (VLM) Feedback system utilizes model-generated or externally-supplied evaluative signals to drive learning, align behavior, or assess outputs in models that ingest both visual and textual inputs. Recent advances have explored leveraging large VLMs as scalable sources of feedback, particularly in reinforcement learning (RL), reward modeling, and autonomous control, where closing the human-in-the-loop or data annotation bottleneck is critical. The following sections synthesize the principles, methodologies, and empirical findings underpinning state-of-the-art VLM feedback approaches, as illustrated by the ERL-VLM framework and closely related work.
1. Foundations and Motivation for VLM Feedback
The development of effective reward functions for RL remains a central challenge, traditionally relying on labor-intensive human engineering or costly RLHF (reinforcement learning from human feedback). Large VLMs—pretrained on web-scale multimodal data and exhibiting substantial generalization—now present an alternative: using AI-generated feedback to supervise reward learning and thereby scale RL with minimal human intervention.
Prior efforts with VLM-based feedback typically utilized pairwise preference labeling: given two trajectory snippets or images, a VLM is queried to indicate which more closely matches a provided task description. This pairwise approach, as exemplified by RL-VLM-F, stabilizes learning relative to direct reward score prompting but is limited by sample inefficiency, expressiveness bottlenecks, and computational cost (Wang et al., 6 Feb 2024). The ERL-VLM framework introduced absolute trajectory ratings as a feedback modality, exploiting the expressive capacity of VLMs for more efficient and robust reward learning (Luu et al., 15 Jun 2025).
VLM feedback is also being explored as a general scalable annotation source for aligning generative models, including in dialogue, safety, and model evaluation (Li et al., 12 Oct 2024). The paradigm shift away from human feedback toward VLM-as-feedback-mediator promises a path toward autonomous agent alignment and reward specification across domains.
2. Core Algorithmic Workflow: Rating-Based RL with VLMs
The ERL-VLM algorithm is organized around an iterative interplay between online trajectory collection, VLM-mediated rating queries, supervised reward-model update, and RL policy improvement, as follows:
- Initialization: Prepare RL policy parameters (e.g., for SAC or IQL), reward model parameters , an empty replay buffer , a rating dataset , and select a large VLM (Gemini-1.5-Pro) as the "teacher" . Task description , feedback query frequency , batched query count , and trajectory horizon are set.
- Trajectory Collection: The agent, using policy , samples rollouts of length , recording states, images, actions, and provisional reward estimates into .
- Periodic VLM Feedback Gathering: Every iterations:
- Sample trajectory segments from .
- Query the VLM with tailored prompts containing each segment and the task description, receiving an absolute rating on a Likert scale (e.g., "Bad"/"Average"/"Good"/"Very Good").
- Append to the rating set .
- Reward Model Training:
- For multiple epochs, stratified-sample minibatches from to guarantee balanced class representation.
- Update to minimize a robust rating-loss , based on the mean absolute error between predicted rating distributions and VLM labels.
- Optionally, apply inverse frequency weighting per class.
- Relabel all transitions in using the updated reward model.
- Policy Learning: Interleave standard RL algorithm updates (on SAC or IQL, etc.) using relabeled rewards.
This feedback-driven RL loop recursively bootstraps both reward function refinement and policy performance, with VLM ratings acting as a surrogate for expensive human feedback.
3. Mathematical Formulation of Rating-Based Reward Learning
Let a trajectory segment be scored by a parametric reward model producing instantaneous rewards . The (normalized) cumulative return is: This scalar is discretized into ordinal rating classes , with class boundaries .
Predicted softmax probabilities for each class are given by: as in the soft binning approach of [White et al., AAAI 2024].
The robust loss for reward learning is the stratified, optionally weighted, mean absolute error: where is the balanced class sampler, is the one-hot label, and .
This choice of MAE (over cross-entropy) yields greater robustness to label noise [Ghosh et al., AAAI 2017]. The reward model thus learns to smoothly interpolate expert-vetted returns between class boundaries, benefiting from the full spectrum of VLM-generated ratings.
4. Key Enhancements for Robustness: Stratification, Weighting, and MAE Loss
Experimental analysis revealed systematic issues in naive rating-based RL—specifically, class imbalance and label noise induced reward collapse and unstable training. ERL-VLM addresses these failure modes through three critical modifications:
- Stratified Minibatching: Ensures even representation of each rating class within every batch, mitigating mode collapse toward the majority class.
- Inverse Class-Frequency Weighting: Amplifies the loss for underrepresented ratings, facilitating balanced learning in the reward model.
- Mean-Absolute-Error Loss: MAE loss is insensitive to random label corruption and more stable under noisy feedback compared to categorical cross-entropy.
Empirical validations showed that this combination preserves expressiveness in the learned reward, sharpens alignment to trajectory quality, and realizes substantial policy gains beyond previous VLM-feedback or CLIP-score reward approaches.
5. VLM Rating Query Design and Implementation
ERL-VLM employs Gemini 1.5-Pro as the VLM teacher. Queries are constructed as concise task- and segment-specific prompts:
- MetaWorld prompt: "Analyzing: <Image> You are shown an image of a robot performing [Task Description]. Focus on the target object and decide quality. ———— Rating (choose from {Bad, Average, Good}):"
- ALFRED prompt: "Analyzing: <N consecutive images> with listed actions. Describe change per step. ———— Rating (choose from {Bad, Average, Good}): based on your analysis."
- Real Robot prompt: Structured analogously to MetaWorld.
The rating scale is a small discrete Likert set (), sufficient for robust ordinal signal. Unlike pairwise-preference-based methods, no data is discarded for uncertainty—every segment receives a rating, boosting sample efficiency.
Prompting single segments (as opposed to pairs) substantially reduces VLM compute costs: each query is halved in context length and token usage, with improved expressiveness and decreased ambiguity in the feedback.
6. Empirical Results and Sample Efficiency
Benchmarks
- MetaWorld (low-level control, SAC): Tasks—Sweep Into, Drawer Open, Soccer
- ALFRED (high-level vision-language, IQL): PickupObject, PutObject, CoolObject, CleanObject (20 total tasks)
- Real Sawyer robot: Sweep Bowl, Drawer Open, Pickup Banana
Comparative Results
| Environment | ERL-VLM | CLIP | RL-VLM-F | BC | Env Reward |
|---|---|---|---|---|---|
| MetaWorld | 85-90% | 50-60% | 55-65% | - | - |
| ALFRED | 70-75% | 30-45% | 10-15% | - | - |
| Real robot | 0.60 | - | - | 0.23 | 0.37 (sparse) |
- Under a fixed VLM-query budget (e.g., 10,000 single-segment queries for MetaWorld), ERL-VLM converges twice as rapidly and achieves +20–30 percentage points higher success rate than VLM-pairwise or CLIP-score baselines.
- On high-level, language-conditioned tasks (ALFRED), absolute ratings dramatically outperform pairwise approaches, which fail to surmount early plateaus.
- Expert trajectory analysis shows the learned reward aligns tightly with task progress ().
This demonstrates that VLM-generated feedback, when appropriately harvested and stabilized, can replace or surpass human supervision on a range of embodied RL benchmarks.
7. Expressiveness and Advantages of Absolute VLM Ratings
Empirical and ablation studies established several advantages for absolute rating feedback:
- Expressiveness: Absolute scores encode a global assessment for an entire segment, providing denser supervision than a local pairwise label.
- Sample Efficiency: Each rating directly induces an n-class learning target, versus requiring pairwise queries to disambiguate ordering.
- Data Retention: No "ambiguous" pairs are rejected—every rated segment guides reward learning.
- Computational Savings: Token-level prompt and context size is reduced, enhancing throughput and lowering annotation cost.
- Reward Alignment: The learned reward curves increase smoothly along expert demonstrations, mirroring ground-truth reward progress.
Theoretical and practical analyses thus strongly favor absolute trajectory ratings—coupled with stratification and robust regression—in scalable VLM-driven reward learning.
8. Limitations and Prospective Directions
ERL-VLM’s remaining challenges include: dependence on VLM generalization to visually diverse tasks, potential brittleness under highly imbalanced or adversarial data, and the practicalities of RL deployment at scale. The VLM teacher must be robust to both language and visual ambiguity, and the reward model’s generalization hinges on the breadth and quality of sampled segments.
Future work may extend this framework via hierarchical rating schemas, automated detection of rare classes, and active query selection for adaptive feedback allocation. The paradigm of AI-generated multi-task feedback extends beyond RL to model evaluation, alignment, and continuous autonomous learning.
By systematically integrating VLM-generated absolute ratings—reinforced by stratified sampling, per-class weighting, and robust regression losses—ERL-VLM establishes a practical and theoretically grounded approach to feedback-driven reinforcement learning, achieving robust, sample-efficient, and highly expressive reward modeling across a spectrum of robotic and vision-language control tasks (Luu et al., 15 Jun 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free