CaRe-DPO: Extended DPO Methods

Updated 3 July 2026

The paper introduces CaRe-DPO variants that generalize standard DPO by integrating domain-specific constraints for diverse optimization and alignment tasks.
It leverages techniques like ADD encoding and project-join trees to efficiently solve hybrid constraint reasoning and enhance multimedia retrieval accuracy.
Empirical evaluations demonstrate state-of-the-art performance, with faster runtimes and effective counterfactual alignment in large language models.

CaRe-DPO denotes a family of Direct Preference Optimization (DPO)–inspired algorithms and frameworks developed for objectives where standard DPO must be generalized, adapted, or extended. While the term "CaRe-DPO" appears in several unrelated lines of research—including dynamic-programming solvers for hybrid constraint reasoning and preference-based alignment of generative models—it consistently refers to the integration of DPO principles with domain-specific supervisory signals or constraints. Notable application domains include combinatorial optimization with pseudo-Boolean and cardinality constraints, text-video retrieval with paired caption optimization, and preference alignment in LLMs via counterfactual prompting.

1. DPO and Its Generalizations

Direct Preference Optimization (DPO) is a framework for aligning generative models using pairwise preference data, typically phrased as maximizing the log-probability that a preferred sample is scored higher than a less-desirable alternative. In the general case, the DPO loss for parameterized policy $\pi_\theta$ with reference policy $\pi_{\text{ref}}$ and scaling parameter $\beta$ is

$\mathcal{L}_{\mathrm{DPO}(\theta)} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(\hat r_\theta(x, y_w) - \hat r_\theta(x, y_l)\right)\right]$

where $\hat r_\theta(x, y) = \beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ , and $\sigma$ is the sigmoid function.

Multiple CaRe-DPO variants extend DPO to:

Decouple and adapt the preference generation procedure (e.g., via counterfactual manipulations or retrieval-based scoring)
Enable fine-grained or groupwise preference supervision
Efficiently represent and solve hybrid discrete optimization problems beyond SAT/MaxSAT

2. CaRe-DPO for Hybrid Constraint Reasoning

One major application of CaRe-DPO is its use in extending dynamic-programming optimization to cardinality and general pseudo-Boolean (PB) constraints. In this setting, CaRe-DPO refers to DPO (Dynamic-Programming Optimization) equipped with native support for cardinality and PB constraints through algebraic decision diagrams (ADDs) (Phan et al., 2022).

Key elements:

Weighted Literal CNF Encodings: Casts the most probable explanation (MPE) problem as maximizing $f(\tau) = \prod_c [c(\tau)] \cdot \prod_x W_x(\tau(x))$ where $[c(\tau)]$ indicates satisfaction of clause $c$ .
ADD-based Representation: Each PB, cardinality, XOR, or CNF constraint is encoded directly as an ADD, drastically reducing representation size from $O(2^n)$ to $\pi_{\text{ref}}$ 0 or $\pi_{\text{ref}}$ 1 for threshold constraints.
Project-Join Tree Execution: A project-join tree guides the decomposition and dynamic programming, with nodes corresponding to constraint clusters and variable projection steps.
Operations: At each tree node, join (multiplication of ADDs) and max-projection (variable elimination) are performed.
Complexity and Scalability: For instances with native PB constraints and low tree-width, performance is exponentially better than CNF-only solvers due to compact ADD encoding.

Empirically, CaRe-DPO solves large hybrid instances—such as random chain XOR-CNF formulas—with orders-of-magnitude faster runtimes than MaxHS, UWrMaxSat, or GaussMaxHS, especially when constraints are natively handled via ADDs (Phan et al., 2022).

3. CaRe-DPO for Text-Video Retrieval

In the context of text-video retrieval, CaRe-DPO refers to a framework for jointly optimizing auxiliary caption generation and multi-modal retrieval, directly connecting captioning to relevance supervision (Lee et al., 20 Sep 2025).

Main components:

Auxiliary Caption Generator ( $\pi_{\text{ref}}$ 2): Starts from a vision-language LLM (e.g., LLaVA-OneVision-7B); generates multiple candidate captions per video.
Retrieval Model ( $\pi_{\text{ref}}$ 3): MLLM (e.g., VideoChat-Flash-7B) with joint attention over video, auxiliary caption, and user query. Role embeddings $\pi_{\text{ref}}$ 4 distinguish caption and query token roles.
Dual-Group DPO (DG-DPO): Trains the captioner not with standard language metrics, but using retrieval scores provided by the retriever. DG-DPO structures preference pairs both within the same video (local) and across videos (global), enforcing that more relevant captions are preferred even across video boundaries.
Training Workflow: Each candidate caption is evaluated by $\pi_{\text{ref}}$ 5 with a masked-video setup to obtain $\pi_{\text{ref}}$ 6; the top-vs-bottom captions and cross-instance pairs define preference data for DPO loss.
DG-DPO Objective:

$\pi_{\text{ref}}$ 7

with $\pi_{\text{ref}}$ 8 balancing within-video and inter-video pairings.

Experimental results show CaRe-DPO achieves state-of-the-art text-to-video (T2V) and video-to-text (V2T) retrieval accuracy (e.g., Recall@1=85.1/82.5 on DiDeMo), superior caption diversity, and stronger model confidence in visually similar (hard) cases (Lee et al., 20 Sep 2025).

4. CaRe-DPO for Counterfactual Alignment of LLMs

The term CaRe-DPO (“Counterfactual & Rejection Direct Preference Optimization”) also appears in LLM alignment, where it enables low-resource, preference-driven fine-tuning of LLMs without human labelers (Butcher, 2024).

Highlights:

Self-Supervised Preference Pairing: Generates synthetic preference pairs by appending desired or undesired style markers to prompts; reference LLM $\pi_{\text{ref}}$ 9 generates positive ( $\beta$ 0), negative ( $\beta$ 1), and baseline ( $\beta$ 2) responses.
Loss construction: Three DPO variants—Encouragement (positive over baseline), Discouragement (baseline over negative), and Contrastive (positive over negative)—can be linearly combined.

$\beta$ 3

Algorithmic loop: See original stepwise pseudocode. Training shifts the model’s default behavior toward or away from certain response styles according to the variant and style instructions chosen.
Empirical findings: On metrics such as named-entity redaction, bias mitigation (BBQ), and hallucination reduction (Vectara FCR), CaRe-DPO outperforms instruction-tuned and SFT baselines while preserving reasoning performance (Butcher, 2024).

A plausible implication is that the synthetic, self-supervised pairing enabled by counterfactual prompting allows flexible, targeted alignment for a wide range of behavioral axes, even those that are infeasible or impractical to annotate at scale.

5. Comparative Evaluation and Empirical Performance

Across application domains, CaRe-DPO variants show substantial gains over both traditional and prior DPO-based baselines.

Domain	Baseline	CaRe-DPO Approach	Main Gains
Hybrid Constraints	MaxHS, UWrMaxSat	ADD-based PGJT+DPO	Up to PAR-2=1.0 vs >6 for competitors (Phan et al., 2022)
Text-Video Retrieval	MM-Embed, LamRA	DG-DPO + role embedding	+1–4 R@1, improved fine-grained retrieval (Lee et al., 20 Sep 2025)
LLM Alignment	SFT, base LLM	Counterfactual DPO	Less bias, fewer hallucinations, style control (Butcher, 2024)

This suggests that distributed preference supervision—whether through retrieval relevance, groupwise ranking, or counterfactual cues—offers notable performance and controllability improvements where classic supervised or pointwise objectives are insufficient.

6. Implementation and Practical Considerations

Hardware: High-throughput parallelism is typical (e.g., 8× NVIDIA H100s for video retrieval, single RTX 3090 for LLM alignment).
Optimization: LoRA adapters are used for parameter-efficient fine-tuning in caption/retriever LLMs and in LLM alignment scenarios.
Open Source: The text-video CaRe-DPO pipeline is available at https://github.com/mlvlab/CaReDPO (Lee et al., 20 Sep 2025).
Limitations: All CaRe-DPO variants inherit the inductive biases and limitations of their underlying models (e.g., data domain specificity, LLM bias). Joint optimization involves elevated computational cost.

7. Future Directions

Potential extensions include:

Reference-free or token-level DPO for finer granularity in preference learning (Lee et al., 20 Sep 2025)
Expansion to new constraint classes or domains beyond pseudo-Boolean reasoning (Phan et al., 2022)
Systematic study of robustness and adaptation for unseen styles or out-of-domain prompts in LLM alignment (Butcher, 2024)

A plausible implication is that further refinement of groupwise and counterfactual preference mechanisms will promote greater alignment, interpretability, and efficiency in large-scale learning and combinatorial optimization systems.