Human Feedback as a Metric

Updated 31 August 2025

Human Feedback as a Metric is a framework that utilizes direct human evaluations—such as ratings, comparisons, and comments—to assess and improve AI outputs.
It employs methods like direct optimization (e.g., RLHF) and surrogate feedback models to translate human judgments into actionable training signals.
This approach enhances AI system design by addressing biases, noise, and efficiency challenges while ensuring outputs align with nuanced human values.

Human feedback as a metric encompasses any systematic use of direct or indirect human judgment to generate evaluative, comparative, or instructive signals for optimizing, selecting, or interpreting AI system behavior. This paradigm extends from explicit ratings, pairwise preferences, and free-form natural language commentary, to information-theoretic or behavioral constructs that serve as proxies for user-centric evaluation. As the scope and expressiveness of AI expand, human feedback is increasingly leveraged to overcome the inadequacies of proxy metrics—such as n-gram overlap or log-likelihood—in tasks where quality, safety, helpfulness, or social alignment are not well characterized by reference baselines or pre-specified reward functions.

1. Formulations and Typologies of Human Feedback

Recent research formalizes human feedback as a mapping from inputs and outputs to some evaluative space:

$h : X \times (Y_1 \times \cdots \times Y_n) \rightarrow F$

where $X$ is the input space (source text, environment state, etc.), $Y_i$ are candidate outputs, and $F$ is the feedback space, which may comprise ratings, rankings, or free-form annotations (Fernandes et al., 2023). The feedback $F$ may be binary, ordinal, scalar, or unstructured. Sophisticated taxonomies further decompose feedback along dimensions such as intent (evaluative, instructive, descriptive), expression (explicit/implicit), engagement (proactive/reactive), content level (instance-, feature-, or meta-level), granularity (action-level, episode-level), and exclusivity (sole/mixed signal) (Metz et al., 18 Nov 2024).

The feedback channel thus spans a spectrum:

Explicit scalar ratings, Likert scales, rankings, or relative pairwise comparisons (Yu et al., 2023, Kagrecha et al., 16 Jul 2025)
Open-ended comments and explanations, natural language critique (Fernandes et al., 2023, Yuan et al., 2023)
Action traces (e.g., “marks” given to agent states at specific timesteps) or clicks (Abramson et al., 2022)
Proxy signals such as crowdsourced votes or public scoring in domain-specific forums (Gorbatovski et al., 19 Jan 2024)
Proxy information-theoretic constructs (e.g., “helpfulness” as reduction in cost, or “empowerment” as mutual information about future state controllability) (Freedman et al., 2020, Baddam et al., 2 Jan 2025)

This formalization supports both the immediate use of feedback (e.g., training, online selection, or evaluation) and the indirect use (e.g., training reward models or meta-metrics).

2. Methods for Integrating Human Feedback

Two principal strategies exist for using human feedback as a metric:

A. Direct Optimization

Here, collected feedback is used as a direct supervisory or evaluative signal:

In RLHF (Reinforcement Learning from Human Feedback), feedback (often via comparison or preference) defines the reward function in RL updates (Stiennon et al., 2020, Abramson et al., 2022, Gorbatovski et al., 19 Jan 2024).
In supervised or imitation learning, human feedback labels trajectories as positive/negative or assign reward values (Wang et al., 2020, Yu et al., 2023).
For prompt engineering, feedback is used directly in grading or acceptance criteria (Yang et al., 11 May 2025).

B. Surrogate Feedback Models

An auxiliary reward or evaluator model is trained to mimic human judgments, allowing efficient large-scale optimization:

$L(\phi) = \log \sigma(\hat h_\phi(x, y^+) - \hat h_\phi(x, y^-))$

This is standard for preference feedback. Once trained, the feedback model can serve as a surrogate for downstream RL (as in PPO for summarization (Stiennon et al., 2020)), supervised learning via filtering (Mukobi et al., 2023), or even for prompt-based evaluation (Yang et al., 11 May 2025). For continuous or multi-aspect feedback—e.g., video or image captioning—regression models are trained to fit human-provided scores (Wada et al., 28 Feb 2024, He et al., 21 Jun 2024).

Aggregation of Feedback:

The choice of aggregation method is crucial, especially for granular feedback (e.g., 5-point or 11-point scales). Sophisticated supervised aggregation outperforms regularized averaging when feedback granularity is high, reducing the number of ratings needed for equivalently accurate estimates of underlying preference distributions (Kagrecha et al., 16 Jul 2025). In contrast, with binary scales, refinement beyond naive averaging confers little benefit.

3. Feedback as a Metric: Quality, Expressiveness, and Reliability

Quality criteria for human feedback have been articulated across three pillars (following (Metz et al., 18 Nov 2024)):

Human-centered: Expressiveness (can users accurately express intent?) and ease (low cognitive and temporal load enable more frequent feedback).
Interface-centered: Definiteness (precise measurement, uncertainty tracking) and context independence (minimizing external confounds).
Model-centered: Precision (low variance), unbiasedness, and informativeness (signal quality for learning).

Empirical evaluations have shown:

Scalar feedback, when properly rescaled and de-noised (e.g., via STEADY or buffer purification), is as precise and more informative than binary feedback (Yu et al., 2023, Wang et al., 2020).
Crowdsourced or community-scored feedback must be normalized and robustified; otherwise, it may reflect popularity bias, temporal confounds, or relational ambiguity (Gorbatovski et al., 19 Jan 2024).

Several works caution against treating human preference scores as unproblematic "gold standards." For example, single scalar preference ratings may underweight attributes like factuality or coherence, being especially sensitive to confounds such as response assertiveness (Hosking et al., 2023). This suggests that feedback-based metrics must be subjected to multi-dimensional analysis, bias mitigation, and sometimes additional modeling of latent intent.

4. Human Feedback-Driven Metric Learning

Metric learning approaches adapt distance or similarity metrics using human feedback:

In semantic relatedness, Mahalanobis distance or a parameterized cosine similarity is learned under quadruplet-based relative constraints, with constraint weights reflecting the difference in human-rated relatedness (Niebler et al., 2017).
For image captioning, multimodal metric learning frameworks (e.g., M²LHF underpinning Polos) regress directly onto normalized human scores, leveraging embeddings to model caption-image-reference relations (Wada et al., 28 Feb 2024).
In video evaluation, human judgment on multiple aspects (visual, temporal, factual alignment) anchors regression targets for models like VideoScore, enabling fine-grained, automatic assessment correlating strongly with human rater distributions (He et al., 21 Jun 2024).

Large-scale curated datasets (e.g., Polaris, VideoFeedback) with multi-aspect, multi-rater judgments improve robustness and generalizability of such metrics.

5. Derived Metrics and Intrinsic Feedback Constructs

Human feedback can generate new, intrinsic metrics for agent evaluation and control:

“Helpfulness” quantifies reduction in human effort due to agent intervention, formulated as $H = \mathrm{cost}(\pi_{A_H}) - \mathrm{cost}(\pi_{A_R + A_H})$ , with stochastic and normalized variants capturing utility across joint plans (Freedman et al., 2020).
“Human empowerment” formalizes user agency as the information-theoretic mutual information $E(z_t^H) = \max_w I(a_t; z_{t+1}^H | z_t^H)$ between agent actions and attainable futures, estimated via variational approximation involving policy, transition, and planning networks (Baddam et al., 2 Jan 2025). Such metrics provide continuous, dynamic measures of social autonomy and can distinguish among navigation strategies unobtainable by simple proxemic rules.
In automated agent evaluation, frameworks like AutoLibra induce interpretable, fine-grained behavioral metrics from open-ended feedback by LLM-driven segmentation, clustering, and meta-metric-based selection (coverage, redundancy) (Zhu et al., 5 May 2025).
Feedback-driven benchmarks can reduce data artifacts and force models to face harder, more generalizable tasks, as shown by real-time, visual human-in-the-loop systems (e.g., VAIDA and its Data Quality Index) (Arunkumar et al., 2023).

6. Limitations, Biases, and Robustification

Key limitations and solutions described in current literature include:

Subjectivity and Confounding: Single-score ratings are susceptible to bias, under-representation of subtle criteria, and stylistic confounds (like assertiveness). Multi-dimensional breakdowns and curated annotator pools can address some of these deficiencies (Hosking et al., 2023).
Noise and Mislabeling: Noisy, adversarial, or low-skill feedback can degrade model alignment. The Hölder-DPO method robustifies preference optimization with a redescending divergence, rendering model parameters insensitive to arbitrarily extreme mislabels and enabling direct dataset contamination estimation via the clean-data likelihood $\xi^*$ (Fujisawa et al., 23 May 2025).
Cost and Data Efficiency: Unlike binary feedback, higher-granularity scales allow advanced aggregators and learning methods to reach equivalent sample efficiency with fewer annotators, translating to direct reductions in data collection costs (Kagrecha et al., 16 Jul 2025).
Proxy Model Overoptimization: In RLHF and surrogate-based systems, models can “hack” reward models by learning to exploit labeling idiosyncrasies, which necessitates regularization, prior preservation (via KL divergence), and diversity metrics (e.g., METEOR similarity for response variance) (Mukobi et al., 2023).

7. Implications for System Design and Future Research

Human feedback as a metric is central to aligning AI outputs with nuanced, real-world notion of quality, safety, and utility. Key design implications include:

Multi-dimensional, task- and domain-specific evaluation and reward models should be deployed, aggregating feedback across scalar, ordinal, and open-ended modalities.
Feedback frameworks should be co-designed with human-computer interaction experts to maximize expressiveness, ease, and context independence (Metz et al., 18 Nov 2024).
Feedback must be robustified against noise/bias and leveraged via meta-metrics to track coverage and redundancy, ensuring metrics reflect underlying user concerns.
Increasingly, proxy or surrogate AI feedback models (e.g., LLM-as-judge or reward model bootstrapping) are used to encode human intent at scale, but ongoing calibration with high-quality real feedback remains essential (Fernandes et al., 2023, Yuan et al., 2023).

This synthesis underscores that human feedback, in all its forms, is neither monolithic nor infallible. It is best understood as a quantifiable, multi-perspective signal that, if carefully elicited, modeled, and aggregated, provides a powerful metric for both the training and evaluation of intelligent systems. As AI systems proliferate into high-stakes domains and become more autonomous, continued work is necessary to ensure that the metrics derived from human feedback are expressive, robust, unbiased, and truly aligned with human values and expectations.