Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 209 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Customer-R1: RL for Personalized Simulation

Updated 10 October 2025
  • Customer-R1 is an RL-based framework that simulates individual shopping sessions by conditioning actions on explicit user persona profiles.
  • It employs a composite reward system with action-correctness and output-format rewards optimized via Group Relative Policy Optimization (GRPO) for improved fidelity.
  • The framework achieves higher accuracy in mimicking real user behaviors, supporting advanced recommender systems and personalized usability testing.

Customer-R1 is an RL-based framework for step-wise personalized simulation of human behaviors, primarily in online shopping environments. Distinct from generic behavioral modeling approaches, Customer-R1 incorporates explicit user persona descriptions and leverages a composite reward system to optimize both the rationale and action generation of LLM agents. The framework addresses shortcomings of prior prompting, supervised fine-tuning (SFT), and RL techniques by directly conditioning user action simulation on persona, thereby increasing fidelity to true user action distributions.

1. Framework Overview and Differentiation

Customer-R1 is designed to produce high-fidelity simulations of individual users' online shopping sessions, with sequential actions conditioned on rich persona descriptors. Unlike legacy population-level policies, the framework conditions both the reasoning (rationale) and the executable action on: (a) HTML page observations and (b) explicit user profiles comprising demographics, personality, and shopping preferences. This approach eliminates the implicit averaging present in previous SFT or RL pipelines, which typically ignore individual variance and yield homogeneous predictions.

Earlier approaches either use prompt engineering, non-personalized RL, or SFT trained on aggregated behavior logs. These are limited in their ability to generate personalized, consistent action trajectories that reflect unique trait-driven decision making. Customer-R1 solves this by joint conditioning: rt,at=F(a1...t1,r1...t1,o1...t,P)r_t, a_t = F(a_{1...t-1}, r_{1...t-1}, o_{1...t}, P), where FF denotes the agent function, oo represents sequential web observations, and PP is the explicit persona vector.

2. Methodology: RL Objective and Reward Design

The principal technical innovation is formulating the simulation as an RL optimization problem guided by composite action-correctness and output-format rewards. The policy is updated using Group Relative Policy Optimization (GRPO), leveraging multi-trajectory rollouts for improved credit assignment.

  • Action Reward (RactionR_{action}): Assigned as 1 iff the predicted action's type and all attributes exactly match ground truth; 0 otherwise. This strict criterion extends to complex actions (e.g., "click" on element "purchase" with additional subtypes).
  • Format Reward (RformatR_{format}): Enforces output conformance to a JSON schema, requiring rationale followed by action.
  • Difficulty-Aware Weighting (w(a)w(a)): Greater reward is assigned for rare or complex actions, mitigating policy collapse on frequent behaviors.

The overall reward is given by:

R=w(a^)(Raction+Rformat)R = w(\hat{a}) \cdot \left(R_{action} + R_{format}\right)

The RL objective is:

J(θ)=E[1Gi=1G1oit{min(ri,t(θ)A^i,clip(ri,t(θ),1ϵ,1+ϵ)A^i)}βDKL(πθπref)]J(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_i|} \sum_t \left\{ \min \big(r_{i,t}(\theta)\hat{A}_i,\, \text{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_i\big) \right\} - \beta D_{KL}(\pi_\theta \, || \, \pi_{ref}) \right]

with

  • ri,t(θ)r_{i,t}(\theta): trajectory likelihood ratio between current and previous policies,
  • A^i\hat{A}_i: normalized group-relative advantage,
  • β\beta: KL penalization coefficient.

Persona data, derived via surveys and interviews, acts as a behavioral prior resolving ambiguous action choices—models are trained to integrate and balance situational cues against user trait influences.

3. Experimental Protocol and Performance

Experiments utilize the OPeRA dataset, which includes 527 shopping sessions and 5,856 action-observation pairs, with rationales and persona profiles for each user. The next-action prediction task requires generating both rationale and action given the user’s session history and profile.

Metrics used:

  • Next Action Generation Accuracy
  • Action Type Macro-F1
  • Fine-Grained Type Accuracy
  • Session Outcome Weighted-F1

Findings:

  • Zero-shot LLMs deliver poor results, dominated by frequent actions.
  • SFT alone raises all metric scores but is still biased by population averages.
  • RL-only improves rare action coverage, but can distort overall distribution.
  • Combined SFT+RL achieves best performance: next action generation accuracy of 39.58%, improved Macro-F1, Fine-Grained Accuracy, and Session Outcome F1. This indicates closest alignment with ground-truth user action distributions in session completion tasks.

4. Persona Integration and Behavioral Fidelity

The explicit conditioning on user persona enables the agent to simulate individual shopping strategies, preferences, and reactions to site context. Each simulation reflects demographic and personality-driven differences; for example, a user described as price-sensitive will exhibit distinct navigation and purchase intent compared to one described as impulsive.

When faced with ambiguous web states (e.g., unclear product categorization), the agent can draw on the persona context to resolve and synthesize the most plausible next steps. Plausibly, this capability supports more realistic usability testing and refines recommendation systems by generating personalized engagement trajectories that mimic true customer behavior variance.

5. Applications and Broader Implications

Beyond online shopping, Customer-R1's architecture can generalize to domains requiring high-resolution simulation of individual human behavior. Applications include but are not limited to:

  • Usability evaluation for dynamic interfaces.
  • Personalized recommender system benchmarking.
  • Simulation-based studies in computational social science (e.g., agent-based modeling with user traits).
  • Personalized education systems anticipating learner responses.
  • Digital marketing campaign optimization using trait-driven user simulations.

The framework’s integration of reasoning, persona, and RL-based optimization sets a precedent for future systems requiring adaptive, trait-sensitive modeling. This suggests expansion possibilities to domains where behavioral prediction at the individual level is critical.

6. Technical Implementation and Modeling Details

The agent is tasked with producing output rt,atr_t, a_t at each step using the following schema:

  • Inputs: a1t1a_{1 \dots t-1} (previous actions), r1t1r_{1 \dots t-1} (previous rationales), o1to_{1 \dots t} (historical HTML observations), PP (persona descriptor).
  • Outputs: Rationale (natural language), Action (typed JSON with all relevant attributes).
  • Rewards: RactionR_{action} (exact match), RformatR_{format} (conformance), w(a)w(a) (difficulty scaling).

Policy optimization employs GRPO, balancing reward maximization against KL regularization to maintain output diversity. Increased weighting for rare/costly actions counteracts the tendency to overfit popular simple behaviors.

7. Impact and Future Directions

Customer-R1 demonstrates empirically validated improvement in personalized user simulation for online shopping, matching observed action distributions and session outcomes with higher accuracy than prior methods. A plausible implication is the enhancement of adaptive personalization techniques across e-commerce and adjacent sectors. As datasets expand and trait modeling improves, extensions could enable finer-grained behavioral synthesis in new environments, including cross-domain applications with overlapping user interaction profiles.

The framework’s hybrid SFT+RL design, sophisticated persona integration, and reward engineering lay a foundation for expanded research into personalized simulation, dynamic content generation, and targeted intervention. Continued developments in RL-based LLM training conditioned on trait vectors are likely to accelerate in the coming years.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Customer-R1 Framework.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube