LoRe: Personalizing LLMs via Low-Rank Reward Modeling (2504.14439v1)

Published 20 Apr 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Personalizing LLMs to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction. Traditional reinforcement learning from human feedback (RLHF) approaches often rely on monolithic value representations, limiting their ability to adapt to individual preferences. We introduce a novel framework that leverages low-rank preference modeling to efficiently learn and generalize user-specific reward functions. By representing reward functions in a low-dimensional subspace and modeling individual preferences as weighted combinations of shared basis functions, our approach avoids rigid user categorization while enabling scalability and few-shot adaptation. We validate our method on multiple preference datasets, demonstrating superior generalization to unseen users and improved accuracy in preference prediction tasks.

Summary

The paper introduces a low-rank reward modeling framework that represents user preferences as linear combinations of shared basis functions.
It leverages collaborative ranking and the Bradley-Terry model to jointly optimize basis parameters and user-specific weights for efficient few-shot adaptation.
Experiments on semi-synthetic and real-world datasets show that LoRe outperforms traditional RLHF methods in scalability and parameter efficiency.

LoRe (Low-Rank Reward Modeling) is a framework designed to personalize LLMs to individual user preferences more effectively than traditional methods. The core problem it addresses is the limitation of standard Reinforcement Learning from Human Feedback (RLHF) which typically trains a single reward model for all users, failing to capture diverse and often conflicting preferences. Existing personalization methods often rely on rigid user categories or require extensive data per user, limiting scalability and generalization to new users.

LoRe proposes leveraging the concept of low-rank preference modeling, inspired by collaborative ranking techniques. Instead of learning a single reward function, LoRe learns a set of $B$ shared basis reward functions, $\mathbf{R}_\phi: \mathcal{X} \times \mathcal{Y} \mapsto \mathbb{R}^B$ . Each user's personalized reward function $\mathbf{p}_i$ is then represented as a linear combination of these basis functions, weighted by a user-specific vector $\mathbf{w}_i \in \Delta^{B-1}$ (a $B$ -dimensional vector summing to 1):

$\mathbf{p}_i(x, y) = \mathbf{w}_i^\top \mathbf{R}_{\phi}(x, y)$

This formulation assumes that the diverse landscape of user preferences can be effectively captured within a low-dimensional subspace spanned by the basis functions.

The learning process involves minimizing the negative log-likelihood of observed pairwise preferences, following the Bradley-Terry (BT) model. For a user $i$ preferring response $y_c$ over $y_r$ for prompt $x$ , the personalized reward difference is $\mathbf{w}_i^\top (\mathbf{R}_\phi(x, y_c) - \mathbf{R}_\phi(x, y_r))$ . The objective is to jointly learn the parameters $\phi$ of the reward basis function $\mathbf{R}_\phi$ and the user-specific weights $\mathbf{w}_i$ for all seen users based on their preference data $\mathcal{D}_{\rm train}$ .

A key practical advantage of LoRe is its few-shot adaptation capability for new, unseen users. Once the reward basis $\mathbf{R}_\phi$ is learned from a set of seen users, it is fixed. For a new user with a small number of preference examples $\mathcal{D}_{\rm fewshot}$ , only their personal weight vector $\mathbf{w}_{\rm new}$ needs to be learned by optimizing the negative log-likelihood on their limited data:

$\mathbf{w}_{\rm new} = \argmin_{\mathbf{w} \in \Delta^{B-1}} \sum_{(x, y_c, y_r) \in \mathcal{D}_i} \log \left(1 + \exp\left(-\mathbf{w}^\top (\mathbf{R}_\phi(x, y_c) - \mathbf{R}_\phi(x, y_r))\right) \right)$

This decoupling of learning (basis learning on all data, weight learning per user) makes personalization efficient, especially when users provide limited feedback.

For practical implementation, LoRe can be built on top of existing pre-trained reward models. A common approach detailed in the paper involves using the embedding from a pre-trained model's pre-final layer, denoted $\mathbf{e}(x, y) \in \mathbb{R}^D$ . The reward basis $\mathbf{R}_\phi$ can then be implemented as a linear transformation of this embedding:

$\mathbf{R}_{\phi}(x, y) = \mathbf{A} \mathbf{e}(x,y)$

where $\mathbf{A} \in \mathbb{R}^{B \times D}$ is the learnable matrix. More complex transformations like a shallow MLP on the embeddings could also be used, or parameter-efficient fine-tuning techniques like LoRA could be applied to earlier layers alongside the final transformation. The choice of architecture depends on the computational budget and the size of the dataset. The number of basis functions $B$ is a hyperparameter tuned via cross-validation.

The LoRe workflow involves:

Collecting data: Gather preference data from a diverse set of seen users.
Joint Learning: Train the reward basis parameters ( $\phi$ , e.g., the matrix $\mathbf{A}$ ) and the user weights ( $\mathbf{w}_i$ for seen users) simultaneously by minimizing the personalized negative log-likelihood over $\mathcal{D}_{\rm train}$ . This step requires optimization methods like Adam.
Few-Shot Adaptation: For a new unseen user, fix the learned reward basis $\mathbf{R}_\phi$ and learn their specific weight vector $\mathbf{w}_{\rm new}$ using their small set of few-shot examples $\mathcal{D}_{\rm fewshot}$ . This is a lightweight optimization step.

LoRe also integrates naturally with multi-objective alignment techniques for personalized response generation. By viewing the $B$ basis reward functions as capturing different latent objectives, a personalized reward function $\mathbf{w}_i^\top \mathbf{R}_\phi$ can be used to steer the generation process. The paper suggests a connection to DPO, showing that learning a low-rank reward basis is equivalent to learning a basis of policies, allowing personalized policies to be constructed as weighted combinations.

The paper evaluates LoRe on semi-synthetic (PersonalLLM) and real-world (Reddit TLDR, PRISM) preference datasets, comparing it against baselines including a monolithic BT model, VPL, and PAL. Experiments show that LoRe consistently achieves higher accuracy in predicting user preferences for both seen and unseen users, particularly excelling in few-shot adaptation scenarios and scaling to large, diverse user populations like PRISM. Unlike VPL and PAL, which often struggle to generalize with limited unseen user data or large numbers of users, LoRe's structured low-rank approach proves more robust and parameter-efficient.

In comparison to prior work:

It avoids explicit user categorization.
It doesn't require learning complex per-user latent representations or encoder networks (unlike VPL).
Its low-rank decomposition offers better scalability and parameter efficiency compared to methods like PAL which involve larger MLP architectures and user-prototype matrices that grow with the number of users. LoRe's parameter count grows linearly with the number of users ( $B \times N$ for user weights) but the reward basis parameters ( $B \times D$ ) are shared, resulting in fewer parameters overall, especially for large $N$ and moderate $B$ .

Overall, LoRe provides a practical and effective method for bringing personalization to RLHF, enabling scalable adaptation to diverse user preferences with minimal data, making it suitable for real-world deployment scenarios. Future work could explore online learning settings where user feedback is collected iteratively.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/avibose22/status/1914558118228582641

https://twitter.com/TheTuringPost/status/1915748879754252328

https://twitter.com/fly51fly/status/1916249563234767004

https://twitter.com/GptMaestro/status/1919108165238481390

YouTube

Show All Videos