Papers
Topics
Authors
Recent
Search
2000 character limit reached

Omni-Preference: A Multimodal Evaluation Paradigm

Updated 7 February 2026
  • Omni-Preference is a framework that generalizes preference evaluations with flexible, multimodal criteria across text, images, audio, and more.
  • It utilizes automated data generation pipelines and teacher-model annotations to achieve scalable and high-confidence pairwise comparisons across diverse domains.
  • Mathematical models, including discriminative and generative objectives, underpin its robust alignment of system outputs with structured, rubric-based evaluations.

Omni-Preference encompasses a class of frameworks, datasets, mathematical models, and training objectives for representing and eliciting preferences across heterogeneous, often high-dimensional, input domains—spanning text, images, video, audio, and even physiological stimuli—while supporting structurally flexible, multi-criterion, or entirely free-form evaluation. Across recent research in LLMs, reward modeling, generative diffusion, and neuroeconomics, Omni-Preference solutions provide mechanisms to synthesize, annotate, and learn from preference data that capture the richness of user intent (e.g., grounded in rubrics, support for arbitrary criteria, or physiological observables) and the complexity of multimodal outputs. The aim is to enable automated, scalable, and interpretable alignment of generative and reasoning systems with nuanced and context-dependent notions of quality, faithfulness, safety, or utility.

1. Core Concepts and Definitions

At its foundation, the Omni-Preference paradigm seeks to generalize the notion of a "preference" beyond statically defined, unidimensional, or single-modality settings.

  • General form: A preference is an ordering or assignment of value to outputs or choices, possibly conditioned on a context and an axis of evaluation.
  • Multimodality: Preferences can be elicited or deployed on outputs in text, images, video, audio, 3D, or even arbitrary sensor data. Recent frameworks require support for pairwise or scalar scoring across any of these representation spaces (Kong et al., 31 Jan 2026, Jin et al., 27 Oct 2025, Chen et al., 31 Aug 2025, Liu et al., 2024).
  • Flexible criteria: Instead of optimizing for a hard-wired notion of helpfulness, harmlessness, or image realism, Omni-Preference systems provide for arbitrary, even free-form, human- or machine-specified criteria at inference or evaluation time (Jin et al., 27 Oct 2025).
  • Rubric- and facet-grounding: Preference judgments may be decomposed by rubrics (e.g., fluency, relevance, accuracy, reasoning, safety, visual grounding, acoustic fidelity), with structured rationales justifying each comparative score (Kong et al., 31 Jan 2026).

An archetypal Omni-Preference dataset consists of tuples:

(c,x,yA,yB,p,SA,SB,J)(c, x, y_A, y_B, p, S_A, S_B, J)

where cc is the criterion (possibly free-form), xx is the input/query, yAy_A and yBy_B are candidate outputs, pp is the preferred choice, SA,SBS_A, S_B are scalar or multidimensional scores, and JJ is a structured, possibly rubric-decomposed, justification.

2. Automated Omni-Preference Data Generation and Pipelines

Modern frameworks operationalize Omni-Preference via large-scale, automated pipelines:

  • Pair synthesis via capability gap: Generate responses yA,yBy_A, y_B to an input xx by systematically contrasting a "strong" generator model MsM_s with a "weak" MwM_w, where yA=Ms(x)y_A = M_s(x) and yB=Mw(x)y_B = M_w(x) (Kong et al., 31 Jan 2026).
  • Teacher model annotation: Pairs are scored by high-capacity LLMs independently, producing overall verdicts, numeric scores, and rubric-grounded rationales; reconciliation, filtering, and merging protocols ensure high-confidence supervision without human labeling (Kong et al., 31 Jan 2026).
  • Multimodal benchmark coverage: Datasets span domains, e.g., HH-RLHF prompts for text, RLAIF-V and VL-RewardBench for images, ActivityNet/ShareGPT-Video for video, Clotho-AQA and Audio-HH-RLHF for audio (Kong et al., 31 Jan 2026, Jin et al., 27 Oct 2025, Chen et al., 31 Aug 2025, Liu et al., 2024).
  • Free-form criterion augmentation: Automatic synthesis of arbitrary evaluation prompts or axes, e.g., by prompting GPT-4o to explain criteria and verifying using secondary models, facilitating instruction-tuning with wide epistemic coverage (Jin et al., 27 Oct 2025).
  • Quality filtering and reconciliation: Use of rule-based filters for duplication, JSON validity, verdict-score consistency, and low-information examples ensures strong integrity in pairwise and scalar preference data (Kong et al., 31 Jan 2026).

3. Mathematical Formulations and Training Objectives

The mathematical underpinnings of Omni-Preference frameworks span both classical and novel preference modeling approaches.

LBT=logexp(r(c,x,yp))exp(r(c,x,y1))+exp(r(c,x,y2))\mathcal{L}_\mathrm{BT} = -\log \frac{\exp(r(c, x, y_p))}{\exp(r(c,x, y_1))+\exp(r(c,x, y_2))}

where rr scores the candidate conditioned on all inputs and the criterion (Jin et al., 27 Oct 2025).

  • Generative objectives: Models may generate chain-of-thought explanations ee and a preference decision pp', using RL (e.g., Group Relative Policy Optimization/GRPO) to maximize expected agreement with reference preferences, with explicit KL regularization (Jin et al., 27 Oct 2025, Kong et al., 31 Jan 2026).
  • Rubric-decomposed reward: Composite reward signals incentivize correct verdicts, score consistency, and rubric justification coverage, as in:

ri=wfmtRfmt(yi)+wprefRpref(yi)+wrubRrub(yi)r_i = w_\mathrm{fmt} R_\mathrm{fmt}(y_i) + w_\mathrm{pref} R_\mathrm{pref}(y_i) + w_\mathrm{rub} R_\mathrm{rub}(y_i)

for each output yiy_i, with advantages normalized within sampled groups (Kong et al., 31 Jan 2026).

  • DPO for multimodal diffusion: In text-to-video settings, the DPO contrastive loss is extended to pairs generated by maximizing a composite "OmniScore"—a weighted aggregate of visual (intra- and inter-frame) and semantic alignment sub-scores (Liu et al., 2024).
  • Neuroeconomic hyperfunction: Individual omni-preference can be represented as a hyperfunction F(ξ)F(\xi), defined by boundary values of complex-analytic functions, mapping all possible stimuli ξ\xi to real values V=F(ξ)V = F(\xi), fully reconstructible from physiological data such as neuron interspike intervals (Shapiro, 2011).

4. Rubrics, Criteria, and Multi-Dimensional Supervision

In contrast to scalar reward or unidimensional preference, Omni-Preference approaches impose or elicit structured, rubric-driven supervision:

Criterion Definition Example Modality-Specific Facet
Fluency Clarity, grammar, conciseness -
Relevance Prompt/scene fidelity Visual/temporal grounding (vision/video)
Accuracy Factual correctness, completeness Acoustic fidelity (audio)
Reasoning Logical structure, inference steps -
Safety No harmful or disallowed content -

Rubrics may be modality-agnostic or include facets specific to, e.g., frame consistency (video), object grounding (vision), or speech-text mapping (audio) (Kong et al., 31 Jan 2026). Free-form criteria allow for user- or model-specified axes at inference time, supporting fine-tuned evaluation and downstream task adaptation (Jin et al., 27 Oct 2025).

5. Representative Datasets and Benchmarks

  • Omni-Preference: 41,000 high-confidence pairwise comparisons across image, video, audio, and text. Preference pairs annotated with rubric-based criteria by multiple LLM teachers and reconciled (Kong et al., 31 Jan 2026).
  • Omni-RewardData: 317,000 pairs (248k general, 69k instruction-tuning with synthesized criteria), supporting discriminative and generative preference learning (Jin et al., 27 Oct 2025).
  • Omni-RewardBench: Evaluation covering nine tasks, five modalities (T2T, TI2T, TV2T, TA2T, T2I, T2V, T2A, T23D, TI2I), annotated for 1–10 free-form criteria per pair (Jin et al., 27 Oct 2025).
  • VideoDPO: Automatically constructed video preference pairs using the "OmniScore," quantifying both visual and semantic quality in text-to-video generation (Liu et al., 2024).
  • Neuroeconomic datasets: Physiological measurements (e.g., spike train histograms) linked directly to value assignments under the hyperfunction framework (Shapiro, 2011).

6. Empirical Results and Analysis

Empirical evaluations demonstrate that Omni-Preference-based reward models substantially improve both accuracy and interpretability over unimodal or rigidity-constrained baselines.

  • Multimodal reward modeling: Omni-RRM achieves 71.8% mean preference accuracy over image (e.g., 67.1% on VL-RewardBench), video (80.2% on ShareGPT-V), and audio (66.8% on Audio-HH-RLHF), with state-of-the-art results for open-source models and significant ablation improvements due to rubric grounding and reinforcement learning on hard pairs (Kong et al., 31 Jan 2026).
  • Generalist RM performance: Omni-RewardModel-BT achieves 73.68% overall accuracy on Omni-RewardBench, compared to 62.18% for base models, with superior robustness to mixed-modality training and free-form evaluation (Jin et al., 27 Oct 2025).
  • Preference alignment in diffusion: VideoDPO delivers marked improvements on both quality and semantic measures (e.g., VC2 baseline 80.44% VBench-Total vs. 81.93% for VideoDPO), validated by both synthetic and qualitative benchmarks (Liu et al., 2024).
  • Neurophysiological comprehensiveness: Hyperfunction-based omni-preference models provably admit all possible preference geometries and are experimentally reconstructible from neural statistics (Shapiro, 2011).

7. Theoretical and Foundational Interpretations

Omni-Preference carries both operational and theoretical significance:

  • No reliance on hardwired metrics: Allows arbitrary, context-aware extension to new domains, modalities, and axes of judgment (e.g., safety, trustworthiness, artistic merit).
  • Automated, scalable supervision: Enables preference elicitation at web-scale without the need for bespoke human annotation, leveraging bootstrapping from multiple model capabilities (Kong et al., 31 Jan 2026, Jin et al., 27 Oct 2025).
  • Axiomatic completeness: Hyperfunctional modeling ensures no loss of expressive power in representing infinite-dimensional or discontinuous preference sets (Shapiro, 2011).
  • Interpretability: Rubric- or chain-of-thought-based explanations provide criterion- and dimension-level insight into model decisions (Kong et al., 31 Jan 2026, Jin et al., 27 Oct 2025).
  • Future directions: Extending frameworks to additional modalities (e.g., tactile, multi-agent settings), more sophisticated end-to-end joint annotation, and parameter-efficient adaptation remain open research areas (Chen et al., 31 Aug 2025).

Omni-Preference thus forms the basis for state-of-the-art multimodal reward modeling, scalable system alignment, and mathematically rigorous representations of both machine and human valuation across arbitrary input domains.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Omni-Preference.