Papers
Topics
Authors
Recent
Search
2000 character limit reached

REDIPO: Diversity Recovery for LLMs

Updated 4 July 2026
  • REDIPO is a DPO pipeline designed to recover diverse, valid answer modes for LLMs narrowed by post-training.
  • It employs base model rewriting, filtering, and quality-bounded diversity scoring to ensure safe, high-quality responses.
  • Empirical evaluations demonstrate that REDIPO enhances output diversity significantly while maintaining or improving model alignment and safety.

REDIPO, presented in the paper as ReDiPORecovering Distributional Diversity with Preference Optimization—is an offline DPO data-construction pipeline for post-trained LLMs. Its purpose is to recover distinct valid answer modes that are often suppressed by instruction tuning and alignment, while preserving the alignment benefits of the instruct model. The method is motivated by the observation that post-training commonly improves helpfulness and safety but also narrows the model’s output distribution toward a small set of canonical responses, especially for open-ended prompts with many acceptable answers (Samuel et al., 28 May 2026).

1. Definition and problem setting

ReDiPO addresses a specific failure mode of post-trained LLMs: distributional diversity loss. In the formulation given by the paper, base models often retain a broad response distribution, whereas instruct models are better aligned but substantially narrower. For brainstorming, creative writing, subjective questions, and related open-ended instructions, this can yield models that perform well under single-sample alignment metrics yet repeatedly produce near-duplicate outputs when sampled multiple times (Samuel et al., 28 May 2026).

The paper frames this narrowing as a consequence of post-training’s preference for mode-seeking behavior. It states that instruction tuning and alignment tend to favor the most typical, safest, and most reward-friendly response, so many valid but less typical alternatives receive little probability mass. This is described as consistent with mode-seeking behavior under reverse-KL-like training and with typicality bias in preference learning (Samuel et al., 28 May 2026).

REDIPO therefore does not attempt to optimize diversity in the abstract. Its central principle is more specific: hold quality approximately fixed, then prefer the more diverse response. This makes diversity a preference signal only among candidates that are already similar in instruction-following quality, rather than conflating novelty with degraded alignment or unsafe behavior (Samuel et al., 28 May 2026).

2. Pipeline architecture

The REDIPO pipeline constructs an offline preference dataset and then applies standard DPO training to the instruct model. The paper organizes the method into four main stages: generation, filtering, diversity scoring/pairing, and DPO training (Samuel et al., 28 May 2026).

At the generation stage, for each prompt pPp \in \mathcal{P}, the method samples kk responses from both the base model MB\mathcal{M_B} and the instruct model MI\mathcal{M_I}. The main experiments use k=16k=16. The crucial asymmetry is that base-model outputs are not used directly. Instead, each base-model response is passed to the instruct model and rewritten in instruct style while preserving the underlying content mode. The rewrite prompt explicitly asks the model to preserve topic, subject, answer, stance, tone, and facts, and to avoid introducing unrelated new information. The paper reports a small human study indicating that topic preservation is generally maintained, with full or partial preservation for the rewritten base responses in most cases (Samuel et al., 28 May 2026).

This rewriting step is central to the method’s logic. The base model serves as a reservoir of answer modes that instruction tuning has weakened, while the instruct model supplies stylistic and alignment cleanup. A plausible implication is that REDIPO treats pre-alignment diversity as a latent resource rather than as noise to be discarded.

3. Filtering and preference construction

After generation and rewriting, all candidates are subjected to three filters. First, an LLM-as-judge safety classifier removes unsafe responses. Second, an instruction-following quality filter scores each response with a reward model RIFR_{\text{IF}}. For each prompt, the method computes the mean instruction-following reward over the instruct-model responses, denoted μpMI\mu_p^{\mathcal{M_I}}, and removes responses satisfying

RIF(p,ri)<(1δ)μpMI,R_{\text{IF}}(p,r_i) < (1-\delta)\mu_p^{\mathcal{M_I}},

with δ=0.15\delta = 0.15 in the experiments. Third, prompts are discarded if fewer than 10 total responses remain, or if fewer than 2 remaining responses were originally generated by the base model (Samuel et al., 28 May 2026).

REDIPO then computes a marginal diversity score for each remaining response riRpr_i \in \mathcal{R}_p:

kk0

where kk1 is an embedding model and kk2 is cosine similarity. The paper uses OpenAI’s text-embedding-3-large. This score is “marginal” because it measures how much a response adds relative to the rest of the candidate set for the same prompt, rather than assigning diversity in isolation (Samuel et al., 28 May 2026).

Preference pairs are constructed only when the two responses are close in instruction-following quality:

kk3

with kk4 in the experiments. Among valid pairs, the response with larger marginal diversity is labeled chosen and the one with smaller marginal diversity is labeled rejected:

kk5

Pairs are ranked by the diversity gap kk6, each response is capped in how often it may appear, and only the top kk7 of ranked pairs per prompt are retained, with kk8 in the main experiments. The final dataset

kk9

is then used for standard DPO training on MB\mathcal{M_B}0 (Samuel et al., 28 May 2026).

This construction is the method’s defining technical move. REDIPO is not merely “base versus instruct” preference optimization. It is a quality-matched, diversity-discriminative pairing scheme.

4. Empirical evaluation

The paper evaluates REDIPO on three model families—Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B—using Dolly-15k prompts restricted to open QA, brainstorming, and creative writing. The evaluation suite covers NoveltyBench for distributional diversity using MB\mathcal{M_B}1, MTBench, IFEval, Arena-Hard, and HarmBench direct-category attack success rate (Samuel et al., 28 May 2026).

The headline result is that REDIPO recovers substantial diversity relative to the instruct checkpoints while largely maintaining alignment metrics and improving safety.

Model NoveltyBench MB\mathcal{M_B}2 Alignment/safety pattern
Qwen3-4B MB\mathcal{M_B}3 MTBench slight drop; HarmBench ASR improves
OLMo-3-7B MB\mathcal{M_B}4 MTBench and IFEval improve; HarmBench ASR improves
LLaMA-3.1-8B MB\mathcal{M_B}5 Alignment largely maintained; HarmBench ASR improves

Relative to the instruct checkpoints, the reported NoveltyBench improvements are +134\% for Qwen3-4B, +33\% for OLMo-3-7B, and +44\% for LLaMA-3.1-8B. On MTBench, the paper reports MB\mathcal{M_B}6 for Qwen3-4B, MB\mathcal{M_B}7 for OLMo-3-7B, and MB\mathcal{M_B}8 for LLaMA-3.1-8B. On IFEval, it reports MB\mathcal{M_B}9, MI\mathcal{M_I}0, and MI\mathcal{M_I}1, respectively. On Arena-Hard, the reported changes are minimal: MI\mathcal{M_I}2, MI\mathcal{M_I}3, and MI\mathcal{M_I}4. On HarmBench direct-category ASR, REDIPO reduces attack success rate on all three models: MI\mathcal{M_I}5, MI\mathcal{M_I}6, and MI\mathcal{M_I}7 (Samuel et al., 28 May 2026).

The paper’s interpretation is that diversity recovery does not require sacrificing safety, and in these experiments safety improves. This suggests that base-model diversity can be reintroduced without simply reopening unsafe modes, provided the preference data are filtered and paired carefully.

5. Ablations and relation to neighboring methods

The ablations isolate the roles of the major components. According to the paper, marginal-diversity pair selection and base-response rewriting are the main drivers of the diversity gains, whereas filtering and quality-bounded pairing are the main mechanisms that preserve alignment. Removing diversity-based pair selection nearly eliminates the novelty gains, and removing rewriting also reduces novelty. By contrast, removing the instruction-following filter or the MI\mathcal{M_I}8-bounded pairing retains much of the diversity gain but worsens MTBench, IFEval, and HarmBench (Samuel et al., 28 May 2026).

This division of labor is methodologically important. REDIPO does not rely on a single scalar objective that jointly optimizes diversity and alignment. Instead, it decomposes the problem into two stages: first, recover latent answer modes through base sampling and rewriting; second, constrain the preference signal so that diversity is learned only among candidates with comparable instruction-following reward.

The paper compares REDIPO most directly to DivPO and DDPO. Against DivPO, REDIPO is argued to be more effective because DivPO operates within a candidate pool that already reflects the post-trained model’s narrowed distribution, whereas REDIPO explicitly retrieves missing answer modes from the base model, rewrites them, filters them, and then forms quality-matched diversity pairs. Empirically, the paper reports that DivPO changes diversity by 0\%, -6\%, and -4\% on Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, respectively, whereas REDIPO yields the gains noted above (Samuel et al., 28 May 2026).

Against DDPO, the paper reports that DDPO attains higher NoveltyBench distinctness on Qwen3-4B but at a steep cost to MTBench, IFEval, Arena-Hard, and HarmBench. REDIPO is therefore positioned not as maximizing diversity at any cost, but as improving the quality-safety-diversity balance (Samuel et al., 28 May 2026).

6. Terminology, scope, and ambiguity

Within the literature reflected here, REDIPO most explicitly denotes the post-training diversity-recovery method of "Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs" (Samuel et al., 28 May 2026). However, the label is potentially ambiguous.

In one unrelated imaging paper, the acronym is not explicitly defined in the abstract but is described as appearing to refer to a RED-based, physics-infused optimization approach for MR elastography reconstruction that combines regularization by denoising with a statistical physical model (Mohammadi et al., 2021). In another case, “REDIPO” is used in secondary description for ReDI, a Reasoning-enhanced query-understanding framework for decomposition and interpretation in retrieval, although the paper itself names the method ReDI rather than REDIPO (Zhong et al., 8 Sep 2025). A further nearby but distinct term is RePO, ReLU-based Preference Optimization, which is a simplified offline alignment method for LLMs and should not be conflated with REDIPO’s data-construction recipe (Wu et al., 10 Mar 2025).

These naming collisions matter because the underlying technical objects are unrelated. REDIPO in the LLM post-training sense is a preference-data construction procedure for recovering output-space diversity. It is not a denoising prior, a diffusion model, a reproducibility platform, or a retrieval-time query decomposition system. The common thread across these similarly named methods is only acronymic proximity, not methodological continuity.

Taken in its precise sense, REDIPO is best understood as a targeted intervention in the post-training pipeline: it uses the base model as a source of suppressed answer modes, rewrites those modes into instruct style, filters for safety and instruction-following quality, and converts marginal diversity into a DPO-compatible preference signal. The broader implication suggested by the paper is that post-trained LLMs need not trade alignment for diversity if the preference data are constructed to isolate diversity among responses that are already comparably aligned (Samuel et al., 28 May 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to REDIPO.