Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
85 tokens/sec
GPT-4o
77 tokens/sec
Gemini 2.5 Pro Pro
60 tokens/sec
o3 Pro
16 tokens/sec
GPT-4.1 Pro
66 tokens/sec
DeepSeek R1 via Azure Pro
34 tokens/sec
2000 character limit reached

User Modeling Tasks with BehaveGPT

Last updated: June 11, 2025

Certainly! Below is a detailed, information-rich answer to "User Modeling Tasks" as addressed in the BehaveGPT paper, with explicit explanations and mathematical formulations ° to convey how BehaveGPT advances user behavior modeling °.


1. Model Architecture

BehaveGPT is built upon a transformer-based architecture, purpose-designed for modeling large-scale, multi-faceted user behavioral data °. The system encodes rich behavioral features and models their dependencies explicitly:

  • Embedding Layers: Four parallel embedding matrices ° encode
    • Weekday EwRI×dE_w \in \mathbb{R}^{I \times d}
    • Time slot EtRI×dE_t \in \mathbb{R}^{I \times d}
    • Location ElRI×dE_l \in \mathbb{R}^{I \times d}
    • User event/action EeRI×dE_e \in \mathbb{R}^{I \times d}

where II is sequence length, and dd is embedding dimension °.

  • Stacked Transformer Blocks °: The concatenated embeddings from all four sources, concat(El,Ew,Et,Ee)\text{concat}(E_l, E_w, E_t, E_e), are input to NN layers of transformers with FlashAttention ° for scalability:

Ht=Transformer(concat(El,Ew,Et,Ee))RI×4dH_t = \text{Transformer}(\text{concat}(E_l, E_w, E_t, E_e)) \in \mathbb{R}^{I \times 4d}

  • Prediction Layer: An MLP ° produces the behavior prediction vector:

m=W2(σ(W1Ht+ε1))+ε2m = \mathbf{W}_2\left(\sigma(\mathbf{W}_1 H_t + \varepsilon_1)\right) + \varepsilon_2

where W1,W2\mathbf{W}_1, \mathbf{W}_2 are learned weights, ε1,ε2\varepsilon_1, \varepsilon_2 are biases, and σ\sigma is a nonlinearity.


2. Pretraining Paradigm: DRO-based Approach

BehaveGPT innovates on pretraining by introducing a Distributionally Robust Optimization ° (DRO °) objective specifically to address the severe long-tail imbalance in behavioral data:

Pbϵ={pb  ϵpb(b)pbtrain(b),b}P_b^\epsilon = \left\{ p_b ~\Big|~ \epsilon p_b(b) \leq p_b^{train}(b), \forall b \right\}

minθsuppbPbϵEbpb[(b;θ)]\min_\theta \sup_{p_b \in P_b^\epsilon} \mathbb{E}_{b \sim p_b} [\ell(b; \theta)]

  • pbtrainp_b^{train}: observed label distribution
  • ϵ\epsilon: deviation control (larger for rare classes)
  • ()\ell(\cdot): per-class loss (e.g., cross-entropy)

Effect:

  • For common behaviors: PbϵP_b^\epsilon is small; loss close to empirical
  • For rare behaviors: PbϵP_b^\epsilon is wider, so model must perform well even under greater uncertainty—directly regularizing tail performance °

3. Behavior Prediction Tasks

BehaveGPT is pre-trained to serve as a foundational model—it can be quickly adapted to, or directly applied in, a wide range of real-world user modeling ° tasks:

A. Next Behavior Prediction

b^t=f(xtI,...,xt1)\hat{b}_t = f(x_{t-I}, ..., x_{t-1})

where each xix_i includes weekday, time, location, event

  • Performance: Achieves 1020%10-20\% higher macro and weighted recall than SOTA ° recommenders and general foundation models on public and industrial datasets

B. New Behavior/Few-shot Prediction

  • Goal: Predict behaviors that were unseen or rare in pretraining using only a few new examples
  • Approach: Reuse pretrained weights for all known classes; only introduce and lightly fine-tune embeddings for new behaviors
  • Result: Outperforms meta-learning and LLM-based methods ° by >20% recall on low-frequency behaviors

C. Long-term (Autoregressive) Generation

  • Goal: Given a context, generate the next NN steps in a user's behavior trajectory
  • Behavioral realism & diversity: Evaluated with sequence similarity and n-gram ° diversity (BLEU, Distinct-2, KS, WD, JSD)—outperforms generative and foundation models in reproducing distributional and individual behavioral characteristics

D. Cross-domain Adaptation

  • Goal: Deploy BehaveGPT's representations to a new (target) domain with domain adaptation—e.g., from app usage to social mobility
  • Selective parameter transfer: Transfer base transformer blocks and embedding layers; newly initialize and train output (prediction) layer appropriate for target domain
  • Result: Gains of >10%>10\% macro and weighted recall/NDCG over best prior strong baselines for apps with little prior behavioral data

4. Experimental Results

Datasets:

  • Honor: 205M logs, 81k users, 114 app events (mobile)
  • Mobile: 4.1M check-ins, 1k users, 2k apps
  • Tencent: 463k logs, 2k users, 24k events

Metrics: Weighted/macro precision (Prec_w, Prec_m), recall (Rec_w, Rec_m), NDCG, HR, distributional stats (KS, WD, Distinct-2, etc).

Key outcomes:

  • Macro recall improvement > 10% on next behavior prediction and new/few-shot behavior tasks vs. LLM-Rec, LIFT, CASTR, etc.
  • Few-shot/generalization: +20–80% macro recall on classes with as low as 0.04% representation in pretraining
  • Long-term generation: Superior multi-step (BLEU, KS, JSD) result—produces realistic, non-repetitive sequences
  • Cross-domain: +10–35% hits@10/NDCG in transfer across non-overlapping user domains
  • Scaling Law °: Shows consistent log-log loss decrease with increasing data/model size (\sim5 data-to-model ratio for optimal learning, much lower than LLMs for text)

5. Applications and Implications

Direct Applications

  • Personalized Recommendations: Balanced, robust predictions, including for rare/long-tail interest areas
  • User Trajectory Simulation: Realistic generative modeling for A/B testing, system simulation, and policy optimization
  • Cold-start ° & Rare Event Modeling: Full coverage of new/unknown behaviors—a major practical win
  • Cross-domain Personalization: Reusable base models ° that adapt quickly to new products, services, or app genres

Broader Implications

  • Foundation Model Paradigm: BehaveGPT demonstrates that "foundation model" approaches are not only viable, but superior for user behavior modeling °—capable of plug-and-play transfer and domain adaptation with strong performance
  • Long-tail Robustness: The DRO-based pretraining sets new standards for fairness and inclusivity in modeling under severe class imbalance—direct applicability to e-commerce, social media, and mobile app ° analytics
  • Efficient Scaling: Data/model ratio and empirical scaling law provide practical guidance to industry for training efficient, high-performing behavior models—much less data required compared to text-LLMs ° for equivalent result

Summary Table: BehaveGPT in User Modeling

Aspect Details / Advances
Architecture Transformer stacks + 4 parallel feature embeddings ° + FlashAttention
Pretraining DRO-based objective: robust to head-tail imbalance, boosting long-tail generalization °
Tasks Supported Next-behavior, few-shot/new-behavior, sequence generation, cross-domain adaptation °
Performance 10–20% macro recall gain vs. SOTA; outstanding generalization in all reported scenarios
Scaling Law Validated for first time in user behavior modeling; guidance for model scaling and data collection
Key Impact Balances accuracy for both popular and rare user actions—critical for personalization fairness
Real-world Utility Applicable for recommendation, simulation, adaptation in apps/services with rich behavioral logs

In summary, BehaveGPT establishes a new foundation model paradigm in user behavior modeling—delivering robust, scalable, and transferable embeddings and predictions for a diverse range of real-world user modeling tasks, with technical innovations that directly address class imbalance and domain adaptation at industrial scale.