Omni-Preference Dataset
- Omni-Preference is a large-scale, multimodal dataset designed to train and evaluate reward models using text, image, video, and audio inputs.
- It employs a two-stage automated pipeline with candidate generation from strong versus weak models and rubric-grounded, multi-dimensional feedback.
- The dataset’s high-confidence, richly annotated preference pairs enhance model alignment, scalability, and performance on diverse downstream tasks.
The Omni-Preference dataset refers to a family of large-scale, multi-modality corpora explicitly constructed for training and evaluating reward modeling and preference alignment in generative models, especially across text, image, video, and audio modalities. Originating from efforts to overcome the limitations of vision-centric, scalar-only, or manually-annotated datasets, Omni-Preference datasets emphasize rubric-grounded rationales, automatic preference synthesis, and fine-grained, multi-dimensional annotations suitable for the training of robust, generalist reward models (RMs) and reinforcement learning pipelines.
1. Methodological Principles and Dataset Construction
Omni-Preference datasets are designed to automate the synthesis of high-quality preference pairs, reduce reliance on human annotation, and enable structured reasoning about model output quality. The canonical pipeline, as instantiated in "Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis" (Kong et al., 31 Jan 2026), follows a two-stage procedure:
- Stage 1: Automated Contrasting Candidate Generation
- Given a multimodal context (containing text and a modality element: image, video, or audio) and a prompt drawn from public benchmarks (e.g., RLAIF-V, ActivityNet, Clotho-AQA), two models of differing “strength" (Ms: strong, Mw: weak) are run to generate a candidate response pair .
- No immediate preference label is assigned at this stage.
- Stage 2: Rubric Annotation, Reconciliation, and Filtering
- Heterogeneous, LLM-based multimodal teachers (e.g., GPT-4o-mini and Doubao-1.5-Pro) assess each pair, providing:
- Scalar scores for both responses:
- A verdict:
- Five dimension-wise textual justifications (fluency, relevance, accuracy, reasoning, safety), plus optionals like temporal_consistency for video.
- Outputs are reconciled by agreement level:
- If both teachers agree on the verdict and the margin supports it, the pair is retained with averaged scores and merged rubrics.
- Discrepant or ambiguous pairs (e.g., verdict conflict, or “equal”) are discarded.
- Rule-based filtering removes pairs with duplicates, empty content, inconsistent verdicts, missing rubric fields, both candidates below quality threshold, or semantic misalignment between verdict and rubric.
At the end of this pipeline, approximately 41,000 high-confidence, richly-annotated, rubric-grounded preference pairs remain (Kong et al., 31 Jan 2026). This fully automated approach removes the need for human-in-the-loop labeling, establishing a new standard for scalability and consistency in preference dataset curation.
2. Data Schema, Rubrics, and Annotation Structure
Each preference pair in Omni-Preference is accompanied by a fixed, multi-dimensional rubric instantiated as a JSON object, capturing both numerical (holistic) and textual (dimension-specific) feedback:
- Global Schema Fields:
score_A,score_B: Scalar quality scores, 0–10 per teacher, then averaged.better: Token indicating which candidate is superior.rubric: Dictionary with keysfluency,relevance,accuracy,reasoning,safety; values are concise textual justifications.- (Optionals) Modality-specific keys, e.g.
temporal_consistency(video).
Sample Rubric Annotation:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{
"score_A": 8,
"score_B": 5,
"better": "A",
"rubric": {
"fluency": "A is concise and well-phrased; B contains minor redundancies.",
"relevance": "A stays on topic; B drifts into irrelevant detail.",
"accuracy": "A is factually correct; B mislabels one object.",
"reasoning": "A gives a clear step-by-step rationale; B’s logic is incomplete.",
"safety": "Neither contains disallowed or harmful content."
},
"final_verdict": "A"
} |
3. Data Composition, Statistics, and Difficulty Spectrum
The Omni-Preference dataset covers a wide spectrum of tasks and modalities, with careful balancing between “easy” (high-margin) and “hard” (low-margin) pairs. According to (Kong et al., 31 Jan 2026) (Table 1):
| Modality | Source(s) | Strong Model | Weak Model | Hard/Easy | Final Samples |
|---|---|---|---|---|---|
| Image | RLAIF-V | Qwen2.5-VL-7B | Qwen2.5-VL-3B | 5.3k/11.7k | 17.0k |
| Video | ActivityNet, Charades | Qwen2.5-VL-7B | Qwen2.5-VL-3B | 3.3k/8.9k | 12.2k |
| Audio | Clotho-AQA | R1-AQA-7B | Qwen2-Audio-7B | 3.0k/8.8k | 11.8k |
| Omni Text & Misc. | Various | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | 11.6k/29.4k | 41.0k total |
- Total pairs: ~41,000.
- Difficulty split: 11.6k hard , 29.4k easy .
- Filtering ratios: 10–28% of initial pairs discarded; teacher tie rates pre-filtering: 0.8–12% by modality; reversal rate 2–32%.
- Per-sample metadata: Context (input text + media), candidate pair, averaged scores, rubric verdict, five justifications, difficulty tag, teacher/model versioning.
This granularity facilitates nuanced benchmarking and targeted training (e.g., resampling on “hard” pairs to emphasize reward model discrimination).
4. Relation to Other Omni-Preference-Style Datasets
The Omni-Preference concept is mirrored and extended in several contemporary corpora, each emphasizing automatic, rubric-grounded, or multi-modal preference construction:
- UltraMix (Ultra-Preference Mix) (Djuhera et al., 14 Nov 2025):
- Multi-source assembly drawing from TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs.
- Automated preference validation and reward-margin filtering ensure 100% coherent pairs ().
- Rich per-pair annotations: task category (12-way), input quality (ordinal), preference reward (quantitative margin), optional difficulty (5-level).
- Final size: 190k pairs, ~30% smaller than TuluDPO, spanning text, code, mathematics, and reasoning.
- Designed to deliver both high-quality and high-diversity preferences, UltraMix is described as an “Omni-Preference” mixture for DPO, yielding consistent empirical gains on SFT and DPO baselines.
- Omni-RewardData (Jin et al., 27 Oct 2025):
- ~317,000 preference pairs, including 69k instruction-tuning pairs with free-form, user-specific preference criteria.
- Modalities include text, image, video, audio, and 3D; all pairs validated by expert annotators under IRB.
- The dataset enables discriminative and generative reward modeling that can adapt to explicit user preference prompts, thus addressing “preference rigidity.”
- Annotation schema supports fields: , where is a natural-language instruction specifying the evaluation criterion.
- OmniDPO-10k (Chen et al., 31 Aug 2025):
- Focused on video-audio-text “omni-modal” hallucination.
- Contains 27,423 preference comparisons across text-preference (audio–video alignment), visual-preference (full vs. blurred), and audio-preference (full vs. muted) pairs.
- All pairs originate from MSRVTT (video/audio), with robust filtering and manual spot-checks to ensure fidelity.
- Explicitly designed for preference optimization with modality-aware extensions to standard DPO objectives.
- VideoDPO (Omni-Preference for Video Generation) (Liu et al., 2024):
- 10,000 preference pairs constructed from 40,000 videos generated for 10,000 prompts.
- Each pair labeled by the “OmniScore,” a composite metric incorporating motion smoothness, temporal flicker, subject consistency, imaging quality, aesthetic quality, dynamic degree, and semantic alignment (all per-prompt).
- No human annotation; re-weighted pairwise preference margins maximize rare/high-contrast supervision.
5. Reward Model Training and Optimization Objectives
Training regimes leveraging Omni-Preference datasets capitalize on their structured, semi-automatic annotation and multi-dimensional rationale. The principal recipes are:
- Supervised Fine-Tuning (SFT):
- Objective: generation of exact JSON-structured rubric and verdict.
- Loss: negative log-likelihood over target output tokens.
- Example (Omni-RRM): (Kong et al., 31 Jan 2026).
- Reinforcement Learning with Structured Rewards (GRPO):
- For sampled outputs from policy , reward is a weighted sum:
- ,
- with (schema validity), (verdict–score consistency), (justification coverage and consistency).
- Policy advantage calculation: over the sample group.
- Optimization: KL-penalized PPO-style update vs. a reference policy.
- Extended DPO with Modality-Aware Losses (OmniDPO):
where each term enforces alignment on full/partial input variants (Chen et al., 31 Aug 2025).
- Pairwise Preference Losses with Margin Re-Weighting (VideoDPO):
with up-weighting rare or high-contrast preference cases (Liu et al., 2024).
6. Evaluation Protocols and Applications
Omni-Preference datasets are designed for both intrinsic and extrinsic evaluation of reward models and alignment strategies:
- Intrinsic metrics: agreement with teacher/rubric verdicts, reward discrimination accuracy on held-out pairs, justification completeness.
- Downstream evaluations: state-of-the-art accuracy on video (80.2% ShareGPT-V) and audio (66.8% Audio-HH-RLHF) benchmarks with Omni-RRM, substantial gains (+17.7% absolute) over vision-centric baselines on image tasks (Kong et al., 31 Jan 2026).
- Alignment generalization: evidence that joint multimodal training improves transfer to single-modality and text benchmarks.
- Preference transfer: fine-tuned RMs using Omni-Preference data outperform SFT and vanilla DPO on a wide range of tasks, including question answering, summarization, instruction following, and generation (documented for UltraMix and Omni-RewardData). Scalability is demonstrated by +2.49 pp accuracy gain with UltraMix at 30% reduced pair count (Djuhera et al., 14 Nov 2025).
Applications include model selection via best-of- reranking, alignment data distillation, and empirical robustness against hallucination or multimodal grounding errors (OmniDPO, VideoDPO).
7. Significance and Future Directions
The emergence of Omni-Preference datasets directly addresses several bottlenecks in scalable, robust alignment for multimodal generative models:
- Automated, rubric-grounded annotation enables consistent, large-scale preference data generation without human labor, providing richer training signals than scalar scores or binary labels.
- Multi-dimensional feedback (structured rubric, textual justifications) supports error diagnosis, interpretability, and nuanced reward shaping, critical for complex modalities such as video and audio.
- Data-centric preference optimization (exemplified in UltraMix and Omni-RewardData) demonstrates that judicious data construction and filtering can yield substantial empirical improvements, even with reduced dataset size.
- Plausible implication: The convergent adoption of rubric-based automation, margin-aware selection, and explicit multi-modality is likely to become foundational in high-fidelity reward modeling for future multimodal agents.
Omni-Preference datasets are expected to catalyze advances in scalable alignment algorithms, comprehensive evaluation protocols, and the fundamental study of cross-modal alignment phenomena. Public releases (see (Kong et al., 31 Jan 2026, Chen et al., 31 Aug 2025, Liu et al., 2024)) facilitate rigorous, reproducible research and enable wide benchmarking of reward models and alignment frameworks across the academic community.