Reflective Personalization Optimization (RPO)

Updated 14 November 2025

RPO is a two-stage framework that decouples generic content generation from user-specific reflective rewriting to enhance personalization.
It integrates supervised fine-tuning and reinforcement learning to train a reflection module that adapts LLM outputs based on user history.
Empirical benchmarks and ablation studies demonstrate that RPO improves content fidelity and personalization across various black-box LLM backbones.

Reflective Personalization Optimization (RPO) is a principled framework for achieving fine-grained, post-hoc personalization of LLMs, particularly when the base models are black-box and cannot be modified directly. Contrasting with traditional context injection or prompt engineering approaches, RPO reframes personalization as an explicit, two-stage optimization process—decoupling content generation from user alignment. This architectural separation resolves intrinsic trade-offs in one-step generation and enables the construction of model-agnostic, efficient personalization layers that can be flexibly deployed across LLM backbones (Hao et al., 7 Nov 2025).

1. Formal Structuring of RPO: Two-Stage Rewriting Paradigm

In RPO, let $q$ denote an arbitrary user query, $U$ the user’s full interaction history, $M_{\text{base}}$ a black-box LLM, and $M_{\text{reflect}}^\theta$ a learnable reflection module parameterized by $\theta$ . Personalization is decomposed into:

Stage 1 (Generic Generation): The base LLM produces a high-quality, but unpersonalized, generic output $g = M_{\text{base}}(q)$ .
Stage 2 (Reflective Rewriting): The reflection module consumes $g$ and a task-relevant user history subset $P_{\text{rel}} \subset U$ (top- $k$ retrieved), outputting a personalized response $r$ via sampling or argmax from the conditional policy $\pi_\theta(r \mid g, P_{\text{rel}})$ .

The inference pipeline is:

$g = M_{\text{base}}(q)$
$P_{\text{rel}} = \text{Retr}(q \oplus g, U)$
$r \leftarrow M_{\text{reflect}}^\theta(q, g, P_{\text{rel}})$

The objective is to learn $\theta$ such that $r$ simultaneously preserves the factual integrity of $g$ and manifests user-specific preference patterns contained in $P_{\text{rel}}$ .

2. Optimization Process: Supervised and Reinforcement Learning

The reflection module is trained in two consecutive phases:

2.1 Supervised Fine-Tuning (SFT)

SFT uses a dataset $\mathcal{D}_{\text{SFT}}$ of structured rewriting trajectories—triples $(q, g, p^*, A_{\text{final}}^{\text{GT}})$ , where $p^*$ is a top-retrieved user profile element and $A_{\text{final}}^{\text{GT}}$ is the reference personalized answer. Standard cross-entropy loss is applied sequentially: $\mathcal{L}_{\text{SFT}}(\theta) = -\sum_{t=1}^T \log P_\theta(A_{\text{final}, t}^{\text{GT}} \mid q, g, p^*, A_{\text{final}, <t}^{\text{GT}})$ This instills core rewriting strategies aligning $g$ to the user context.

2.2 Reinforcement Learning (RL) Polishing

Further optimization is performed via RL on a task-specific reward $R(r,U)$ . The token-by-token output is modeled as an MDP, and a KL-regularized REINFORCE variant is applied, constraining deviations from the SFT policy: $r_T = R(q, g, P_{\text{rel}}, r) - \beta \cdot \text{KL}\left[ \pi_\theta(\cdot \mid S_T) \parallel \pi_{\text{SFT}}(\cdot \mid S_T) \right]$ Gradient estimation uses the REINFORCE $++$ baseline. A progressive multi-context curriculum increases $k=|P_{\text{rel}}|$ from 2 to 6 over training epochs, compelling the model to handle both sparse and dense user histories.

3. Theoretical Motivation: Alignment–Quality Decoupling

Direct context-injection approaches require the LLM to simultaneously resolve content selection and user style adaptation. This conflation degrades either factual accuracy or personalization granularity. RPO enforces a strict division of labor: $M_{\text{base}}$ optimizes for content fidelity, while $M_{\text{reflect}}^\theta$ focuses on precise response-level personal adaptation. Ablation studies demonstrate that both components (SFT & RL) are necessary; individually, neither subsystem achieves state-of-the-art performance on both content and personalization axes.

4. Empirical Validation: Benchmarks and Ablations

Representative results on the LaMP benchmark, stratified by user/time splits and multiple personalization forms, reveal RPO’s quantitative superiority:

Method	LaMP-2 Acc/F1	LaMP-3 MAE/RMSE	LaMP-5 R-1/R-L	LaMP-7 R-1/R-L
Zero-shot	0.214/0.285	0.361/0.703	0.444/0.394	0.445/0.396
ICL	0.279/0.336	0.333/0.662	0.452/0.395	0.453/0.397
RAG	0.282/0.340	0.328/0.655	0.457/0.399	0.459/0.402
HYDRA	0.291/0.351	0.318/0.638	0.473/0.412	0.471/0.411
RPO	0.355/0.400	0.252/0.564	0.498/0.425	0.499/0.427

Ablation (Table 2) underlines the essentiality of both SFT and RL: removing either impairs both alignment and content preservation.

Dynamic progressive-shot curriculum (see Figure 1 in (Hao et al., 7 Nov 2025)) outperforms all fixed- $k$ regimes, attesting to the importance of training the reflection module across variable-profile breadths.

5. Model-Agnosticism, Generalizability, Limitations

RPO’s reflection module treats $M_{\text{base}}$ as an immutable, API-accessed black box. Model-agnostic trials swapping between DeepSeek-V3, Qwen3, and GPT-4o-mini yield indistinguishable personalization performance, demonstrating that the personalized rewriting layer generalizes across LLM backbones without retraining.

The curriculum learning of rewriting across varying user history densities imparts robustness to both sparse and noisy user data—critical for practical deployment. RPO is validated on classification, regression, and generation tasks.

Key limitations:

SFT phase is contingent on access to trajectory data from teacher models.
RL rewards depend on automated task metrics (e.g., ROUGE, MAE), which may incompletely capture subjective or latent user preferences.

Potential improvements include multimodal extensions, reward model refinement via learned preference models or limited human intervention, scaling the reflection module for deeper reasoning, and joint retriever–reflector training.

6. Relationship to Broader Reflective Personalization Literature

RPO, as formulated in (Hao et al., 7 Nov 2025), is a paradigmatic instance of explicit personalization optimization, distinct in its black-box, decoupled post-processing policy. Related frameworks—such as Critique-Post-Edit RL, which use generative reward models and self-editing for controllable personalization (Zhu et al., 21 Oct 2025), or persona refinement via iterative cognitive divergence analysis (Yao et al., 16 Oct 2025)—share RPO’s emphasis on explicit, iterative alignment via external reflection/optimization agents. However, RPO’s two-stage response rewriting, progressive-shot curriculum, and task-agnostic black-box design are distinguishing features for model-agnostic, post-hoc deployment.

7. Practical and Theoretical Implications

The decoupling principle of RPO enables efficient post-hoc personalization with minimal computational overhead for inference-time use. Its plug-and-play architecture is aligned with deployment requirements for commercial black-box LLM APIs. The trade-offs illuminated by the RPO pipeline—between content integrity and stylistic alignment—establish a reproducible framework for personalized, user-centric generative AI, with extensible applicability to future personalized conversational, recommendation, and agentic reasoning domains.