Vision-Language Model Feedback

Updated 9 November 2025

VLM Feedback is a system that uses model-generated absolute ratings to supervise reinforcement learning, reducing dependence on human feedback.
The approach employs stratified sampling, inverse frequency weighting, and mean absolute error loss to enhance reward model robustness and sample efficiency.
Empirical results, exemplified by the ERL-VLM framework, show faster convergence and higher success rates compared to pairwise methods and CLIP-based baselines.

A Vision-LLM (VLM) Feedback system utilizes model-generated or externally-supplied evaluative signals to drive learning, align behavior, or assess outputs in models that ingest both visual and textual inputs. Recent advances have explored leveraging large VLMs as scalable sources of feedback, particularly in reinforcement learning (RL), reward modeling, and autonomous control, where closing the human-in-the-loop or data annotation bottleneck is critical. The following sections synthesize the principles, methodologies, and empirical findings underpinning state-of-the-art VLM feedback approaches, as illustrated by the ERL-VLM framework and closely related work.

1. Foundations and Motivation for VLM Feedback

The development of effective reward functions for RL remains a central challenge, traditionally relying on labor-intensive human engineering or costly RLHF (reinforcement learning from human feedback). Large VLMs—pretrained on web-scale multimodal data and exhibiting substantial generalization—now present an alternative: using AI-generated feedback to supervise reward learning and thereby scale RL with minimal human intervention.

Prior efforts with VLM-based feedback typically utilized pairwise preference labeling: given two trajectory snippets or images, a VLM is queried to indicate which more closely matches a provided task description. This pairwise approach, as exemplified by RL-VLM-F, stabilizes learning relative to direct reward score prompting but is limited by sample inefficiency, expressiveness bottlenecks, and computational cost (Wang et al., 2024). The ERL-VLM framework introduced absolute trajectory ratings as a feedback modality, exploiting the expressive capacity of VLMs for more efficient and robust reward learning (Luu et al., 15 Jun 2025).

VLM feedback is also being explored as a general scalable annotation source for aligning generative models, including in dialogue, safety, and model evaluation (Li et al., 2024). The paradigm shift away from human feedback toward VLM-as-feedback-mediator promises a path toward autonomous agent alignment and reward specification across domains.

2. Core Algorithmic Workflow: Rating-Based RL with VLMs

The ERL-VLM algorithm is organized around an iterative interplay between online trajectory collection, VLM-mediated rating queries, supervised reward-model update, and RL policy improvement, as follows:

Initialization: Prepare RL policy parameters $\theta$ (e.g., for SAC or IQL), reward model parameters $\psi$ , an empty replay buffer $\mathcal B$ , a rating dataset $\mathcal D$ , and select a large VLM (Gemini-1.5-Pro) as the "teacher" $\mathcal T$ . Task description $l$ , feedback query frequency $K$ , batched query count $N$ , and trajectory horizon $T$ are set.
Trajectory Collection: The agent, using policy $\pi_\theta$ , samples rollouts of length $T$ , recording states, images, actions, and provisional reward estimates $\hat r_\psi(s_t,a_t)$ into $\mathcal B$ .
Periodic VLM Feedback Gathering: Every $K$ $K$ iterations:
- Sample $N$ trajectory segments $\{\sigma_j\}$ from $\mathcal B$ .
- Query the VLM $\mathcal T$ with tailored prompts containing each segment and the task description, receiving an absolute rating $\tilde y_j$ on a Likert scale (e.g., "Bad"/"Average"/"Good"/"Very Good").
- Append $(\sigma_j, \tilde y_j)$ to the rating set $\mathcal D$ .
Reward Model Training:
- For multiple epochs, stratified-sample minibatches from $\mathcal D$ to guarantee balanced class representation.
- Update $\psi$ to minimize a robust rating-loss $\mathcal L_{\mathrm{ERbRL}}$ , based on the mean absolute error between predicted rating distributions and VLM labels.
- Optionally, apply inverse frequency weighting per class.
- Relabel all transitions in $\mathcal B$ using the updated reward model.
Policy Learning: Interleave standard RL algorithm updates (on SAC or IQL, etc.) using relabeled rewards.

This feedback-driven RL loop recursively bootstraps both reward function refinement and policy performance, with VLM ratings acting as a surrogate for expensive human feedback.

3. Mathematical Formulation of Rating-Based Reward Learning

Let a trajectory segment $\sigma = (s_1, a_1, \ldots, s_H, a_H)$ be scored by a parametric reward model producing instantaneous rewards $\hat r_\psi(s,a)$ . The (normalized) cumulative return is: $\tilde R(\sigma) = \mathrm{MinMax}\left(\sum_{t=1}^H \hat r_\psi(s_t, a_t)\right) \in [0,1].$ This scalar is discretized into $n$ ordinal rating classes $\{ 0, \ldots, n-1 \}$ , with class boundaries $\bar R_0 \leq \cdots \leq \bar R_n = 1$ .

Predicted softmax probabilities for each class are given by: $P_\sigma(i) = \frac{\exp(-(\tilde R(\sigma) - \bar R_i)(\tilde R(\sigma) - \bar R_{i+1}))}{\sum_{j=0}^{n-1} \exp(-(\tilde R(\sigma) - \bar R_j)(\tilde R(\sigma) - \bar R_{j+1}))},$ as in the soft binning approach of [White et al., AAAI 2024].

The robust loss for reward learning is the stratified, optionally weighted, mean absolute error: $\mathcal{L}_{\mathrm{ERbRL}}(\psi; \mathcal{D}) = \mathbb{E}_{(\sigma, \tilde y) \sim \mathcal{U}_S(\mathcal{D})} \sum_{i=0}^{n-1} w_i \left| \mu_\sigma(i) - P_\sigma(i) \right|,$ where $\mathcal{U}_S$ is the balanced class sampler, $\mu_\sigma(i) = \mathbf{1}[i = \tilde y]$ is the one-hot label, and $w_i \propto 1/\max(\epsilon, \mathrm{freq}(i))$ .

This choice of MAE (over cross-entropy) yields greater robustness to label noise [Ghosh et al., AAAI 2017]. The reward model thus learns to smoothly interpolate expert-vetted returns between class boundaries, benefiting from the full spectrum of VLM-generated ratings.

4. Key Enhancements for Robustness: Stratification, Weighting, and MAE Loss

Experimental analysis revealed systematic issues in naive rating-based RL—specifically, class imbalance and label noise induced reward collapse and unstable training. ERL-VLM addresses these failure modes through three critical modifications:

Stratified Minibatching: Ensures even representation of each rating class within every batch, mitigating mode collapse toward the majority class.
Inverse Class-Frequency Weighting: Amplifies the loss for underrepresented ratings, facilitating balanced learning in the reward model.
Mean-Absolute-Error Loss: MAE loss is insensitive to random label corruption and more stable under noisy feedback compared to categorical cross-entropy.

Empirical validations showed that this combination preserves expressiveness in the learned reward, sharpens alignment to trajectory quality, and realizes substantial policy gains beyond previous VLM-feedback or CLIP-score reward approaches.

5. VLM Rating Query Design and Implementation

ERL-VLM employs Gemini 1.5-Pro as the VLM teacher. Queries are constructed as concise task- and segment-specific prompts:

MetaWorld prompt: "Analyzing: <Image> You are shown an image of a robot performing [Task Description]. Focus on the target object and decide quality. ———— Rating (choose from {Bad, Average, Good}):"
ALFRED prompt: "Analyzing: <N consecutive images> with listed actions. Describe change per step. ———— Rating (choose from {Bad, Average, Good}): based on your analysis."
Real Robot prompt: Structured analogously to MetaWorld.

The rating scale is a small discrete Likert set ( $n \in \{2,3,4\}$ ), sufficient for robust ordinal signal. Unlike pairwise-preference-based methods, no data is discarded for uncertainty—every segment receives a rating, boosting sample efficiency.

Prompting single segments (as opposed to pairs) substantially reduces VLM compute costs: each query is halved in context length and token usage, with improved expressiveness and decreased ambiguity in the feedback.

6. Empirical Results and Sample Efficiency

Benchmarks

MetaWorld (low-level control, SAC): Tasks—Sweep Into, Drawer Open, Soccer
ALFRED (high-level vision-language, IQL): PickupObject, PutObject, CoolObject, CleanObject (20 total tasks)
Real Sawyer robot: Sweep Bowl, Drawer Open, Pickup Banana

Comparative Results

Environment	ERL-VLM	CLIP	RL-VLM-F	BC	Env Reward
MetaWorld	85-90%	50-60%	55-65%	-	-
ALFRED	70-75%	30-45%	10-15%	-	-
Real robot	0.60	-	-	0.23	0.37 (sparse)

Under a fixed VLM-query budget (e.g., 10,000 single-segment queries for MetaWorld), ERL-VLM converges twice as rapidly and achieves +20–30 percentage points higher success rate than VLM-pairwise or CLIP-score baselines.
On high-level, language-conditioned tasks (ALFRED), absolute ratings dramatically outperform pairwise approaches, which fail to surmount early plateaus.
Expert trajectory analysis shows the learned reward aligns tightly with task progress ( $\mathrm{corr} \approx 0.9$ ).

This demonstrates that VLM-generated feedback, when appropriately harvested and stabilized, can replace or surpass human supervision on a range of embodied RL benchmarks.

7. Expressiveness and Advantages of Absolute VLM Ratings

Empirical and ablation studies established several advantages for absolute rating feedback:

Expressiveness: Absolute scores encode a global assessment for an entire segment, providing denser supervision than a local pairwise label.
Sample Efficiency: Each rating directly induces an n-class learning target, versus requiring $\mathcal{O}(n)$ pairwise queries to disambiguate ordering.
Data Retention: No "ambiguous" pairs are rejected—every rated segment guides reward learning.
Computational Savings: Token-level prompt and context size is reduced, enhancing throughput and lowering annotation cost.
Reward Alignment: The learned reward curves increase smoothly along expert demonstrations, mirroring ground-truth reward progress.

Theoretical and practical analyses thus strongly favor absolute trajectory ratings—coupled with stratification and robust regression—in scalable VLM-driven reward learning.

8. Limitations and Prospective Directions

ERL-VLM’s remaining challenges include: dependence on VLM generalization to visually diverse tasks, potential brittleness under highly imbalanced or adversarial data, and the practicalities of RL deployment at scale. The VLM teacher must be robust to both language and visual ambiguity, and the reward model’s generalization hinges on the breadth and quality of sampled segments.

Future work may extend this framework via hierarchical rating schemas, automated detection of rare classes, and active query selection for adaptive feedback allocation. The paradigm of AI-generated multi-task feedback extends beyond RL to model evaluation, alignment, and continuous autonomous learning.

By systematically integrating VLM-generated absolute ratings—reinforced by stratified sampling, per-class weighting, and robust regression losses—ERL-VLM establishes a practical and theoretically grounded approach to feedback-driven reinforcement learning, achieving robust, sample-efficient, and highly expressive reward modeling across a spectrum of robotic and vision-language control tasks (Luu et al., 15 Jun 2025).

PDF Markdown Chat (Pro)

References (3)

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback (2024)

Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models (2025)

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Vision-Language Model (VLM) Feedback.

Vision-Language Model Feedback

1. Foundations and Motivation for VLM Feedback

2. Core Algorithmic Workflow: Rating-Based RL with VLMs

3. Mathematical Formulation of Rating-Based Reward Learning

4. Key Enhancements for Robustness: Stratification, Weighting, and MAE Loss

5. VLM Rating Query Design and Implementation

6. Empirical Results and Sample Efficiency

Benchmarks

Comparative Results

7. Expressiveness and Advantages of Absolute VLM Ratings

8. Limitations and Prospective Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Vision-Language Model Feedback

1. Foundations and Motivation for VLM Feedback

2. Core Algorithmic Workflow: Rating-Based RL with VLMs

3. Mathematical Formulation of Rating-Based Reward Learning

4. Key Enhancements for Robustness: Stratification, Weighting, and MAE Loss

5. VLM Rating Query Design and Implementation

6. Empirical Results and Sample Efficiency

Benchmarks

Comparative Results

7. Expressiveness and Advantages of Absolute VLM Ratings

8. Limitations and Prospective Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research