Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoRe: Personalizing LLMs via Low-Rank Reward Modeling (2504.14439v1)

Published 20 Apr 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Personalizing LLMs to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction. Traditional reinforcement learning from human feedback (RLHF) approaches often rely on monolithic value representations, limiting their ability to adapt to individual preferences. We introduce a novel framework that leverages low-rank preference modeling to efficiently learn and generalize user-specific reward functions. By representing reward functions in a low-dimensional subspace and modeling individual preferences as weighted combinations of shared basis functions, our approach avoids rigid user categorization while enabling scalability and few-shot adaptation. We validate our method on multiple preference datasets, demonstrating superior generalization to unseen users and improved accuracy in preference prediction tasks.

Summary

  • The paper introduces a low-rank reward modeling framework that represents user preferences as linear combinations of shared basis functions.
  • It leverages collaborative ranking and the Bradley-Terry model to jointly optimize basis parameters and user-specific weights for efficient few-shot adaptation.
  • Experiments on semi-synthetic and real-world datasets show that LoRe outperforms traditional RLHF methods in scalability and parameter efficiency.

LoRe (Low-Rank Reward Modeling) is a framework designed to personalize LLMs to individual user preferences more effectively than traditional methods. The core problem it addresses is the limitation of standard Reinforcement Learning from Human Feedback (RLHF) which typically trains a single reward model for all users, failing to capture diverse and often conflicting preferences. Existing personalization methods often rely on rigid user categories or require extensive data per user, limiting scalability and generalization to new users.

LoRe proposes leveraging the concept of low-rank preference modeling, inspired by collaborative ranking techniques. Instead of learning a single reward function, LoRe learns a set of BB shared basis reward functions, Rϕ:X×YRB\mathbf{R}_\phi: \mathcal{X} \times \mathcal{Y} \mapsto \mathbb{R}^B. Each user's personalized reward function pi\mathbf{p}_i is then represented as a linear combination of these basis functions, weighted by a user-specific vector wiΔB1\mathbf{w}_i \in \Delta^{B-1} (a BB-dimensional vector summing to 1):

pi(x,y)=wiRϕ(x,y)\mathbf{p}_i(x, y) = \mathbf{w}_i^\top \mathbf{R}_{\phi}(x, y)

This formulation assumes that the diverse landscape of user preferences can be effectively captured within a low-dimensional subspace spanned by the basis functions.

The learning process involves minimizing the negative log-likelihood of observed pairwise preferences, following the Bradley-Terry (BT) model. For a user ii preferring response ycy_c over yry_r for prompt xx, the personalized reward difference is wi(Rϕ(x,yc)Rϕ(x,yr))\mathbf{w}_i^\top (\mathbf{R}_\phi(x, y_c) - \mathbf{R}_\phi(x, y_r)). The objective is to jointly learn the parameters ϕ\phi of the reward basis function Rϕ\mathbf{R}_\phi and the user-specific weights wi\mathbf{w}_i for all seen users based on their preference data Dtrain\mathcal{D}_{\rm train}.

A key practical advantage of LoRe is its few-shot adaptation capability for new, unseen users. Once the reward basis Rϕ\mathbf{R}_\phi is learned from a set of seen users, it is fixed. For a new user with a small number of preference examples Dfewshot\mathcal{D}_{\rm fewshot}, only their personal weight vector wnew\mathbf{w}_{\rm new} needs to be learned by optimizing the negative log-likelihood on their limited data:

wnew=arg minwΔB1(x,yc,yr)Dilog(1+exp(w(Rϕ(x,yc)Rϕ(x,yr))))\mathbf{w}_{\rm new} = \argmin_{\mathbf{w} \in \Delta^{B-1}} \sum_{(x, y_c, y_r) \in \mathcal{D}_i} \log \left(1 + \exp\left(-\mathbf{w}^\top (\mathbf{R}_\phi(x, y_c) - \mathbf{R}_\phi(x, y_r))\right) \right)

This decoupling of learning (basis learning on all data, weight learning per user) makes personalization efficient, especially when users provide limited feedback.

For practical implementation, LoRe can be built on top of existing pre-trained reward models. A common approach detailed in the paper involves using the embedding from a pre-trained model's pre-final layer, denoted e(x,y)RD\mathbf{e}(x, y) \in \mathbb{R}^D. The reward basis Rϕ\mathbf{R}_\phi can then be implemented as a linear transformation of this embedding:

Rϕ(x,y)=Ae(x,y)\mathbf{R}_{\phi}(x, y) = \mathbf{A} \mathbf{e}(x,y)

where ARB×D\mathbf{A} \in \mathbb{R}^{B \times D} is the learnable matrix. More complex transformations like a shallow MLP on the embeddings could also be used, or parameter-efficient fine-tuning techniques like LoRA could be applied to earlier layers alongside the final transformation. The choice of architecture depends on the computational budget and the size of the dataset. The number of basis functions BB is a hyperparameter tuned via cross-validation.

The LoRe workflow involves:

  1. Collecting data: Gather preference data from a diverse set of seen users.
  2. Joint Learning: Train the reward basis parameters (ϕ\phi, e.g., the matrix A\mathbf{A}) and the user weights (wi\mathbf{w}_i for seen users) simultaneously by minimizing the personalized negative log-likelihood over Dtrain\mathcal{D}_{\rm train}. This step requires optimization methods like Adam.
  3. Few-Shot Adaptation: For a new unseen user, fix the learned reward basis Rϕ\mathbf{R}_\phi and learn their specific weight vector wnew\mathbf{w}_{\rm new} using their small set of few-shot examples Dfewshot\mathcal{D}_{\rm fewshot}. This is a lightweight optimization step.

LoRe also integrates naturally with multi-objective alignment techniques for personalized response generation. By viewing the BB basis reward functions as capturing different latent objectives, a personalized reward function wiRϕ\mathbf{w}_i^\top \mathbf{R}_\phi can be used to steer the generation process. The paper suggests a connection to DPO, showing that learning a low-rank reward basis is equivalent to learning a basis of policies, allowing personalized policies to be constructed as weighted combinations.

The paper evaluates LoRe on semi-synthetic (PersonalLLM) and real-world (Reddit TLDR, PRISM) preference datasets, comparing it against baselines including a monolithic BT model, VPL, and PAL. Experiments show that LoRe consistently achieves higher accuracy in predicting user preferences for both seen and unseen users, particularly excelling in few-shot adaptation scenarios and scaling to large, diverse user populations like PRISM. Unlike VPL and PAL, which often struggle to generalize with limited unseen user data or large numbers of users, LoRe's structured low-rank approach proves more robust and parameter-efficient.

In comparison to prior work:

  • It avoids explicit user categorization.
  • It doesn't require learning complex per-user latent representations or encoder networks (unlike VPL).
  • Its low-rank decomposition offers better scalability and parameter efficiency compared to methods like PAL which involve larger MLP architectures and user-prototype matrices that grow with the number of users. LoRe's parameter count grows linearly with the number of users (B×NB \times N for user weights) but the reward basis parameters (B×DB \times D) are shared, resulting in fewer parameters overall, especially for large NN and moderate BB.

Overall, LoRe provides a practical and effective method for bringing personalization to RLHF, enabling scalable adaptation to diverse user preferences with minimal data, making it suitable for real-world deployment scenarios. Future work could explore online learning settings where user feedback is collected iteratively.

Youtube Logo Streamline Icon: https://streamlinehq.com