Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 161 tok/s Pro

2000 character limit reached

RecGPT Technical Report (2507.22879v2)

Published 30 Jul 2025 in cs.IR and cs.CL

Abstract: Recommender systems are among the most impactful applications of artificial intelligence, serving as critical infrastructure connecting users, merchants, and platforms. However, most current industrial systems remain heavily reliant on historical co-occurrence patterns and log-fitting objectives, i.e., optimizing for past user interactions without explicitly modeling user intent. This log-fitting approach often leads to overfitting to narrow historical preferences, failing to capture users' evolving and latent interests. As a result, it reinforces filter bubbles and long-tail phenomena, ultimately harming user experience and threatening the sustainability of the whole recommendation ecosystem. To address these challenges, we rethink the overall design paradigm of recommender systems and propose RecGPT, a next-generation framework that places user intent at the center of the recommendation pipeline. By integrating LLMs into key stages of user interest mining, item retrieval, and explanation generation, RecGPT transforms log-fitting recommendation into an intent-centric process. To effectively align general-purpose LLMs to the above domain-specific recommendation tasks at scale, RecGPT incorporates a multi-stage training paradigm, which integrates reasoning-enhanced pre-alignment and self-training evolution, guided by a Human-LLM cooperative judge system. Currently, RecGPT has been fully deployed on the Taobao App. Online experiments demonstrate that RecGPT achieves consistent performance gains across stakeholders: users benefit from increased content diversity and satisfaction, merchants and the platform gain greater exposure and conversions. These comprehensive improvement results across all stakeholders validates that LLM-driven, intent-centric design can foster a more sustainable and mutually beneficial recommendation ecosystem.

Collections

Summary

The paper introduces an intent-centric framework that leverages LLMs and multi-task fine-tuning to enhance recommendation quality in industrial-scale systems.
It employs hierarchical sequence compression, curriculum learning, and a tri-tower retrieval design to efficiently manage ultra-long user histories and semantic alignment.
Empirical results on Taobao show significant improvements in user engagement, fairness, and key performance metrics compared to traditional log-fitting approaches.

RecGPT: An Intent-Centric LLM Framework for Industrial-Scale Recommendation

Introduction and Motivation

RecGPT introduces a paradigm shift in industrial recommender systems by centering the recommendation pipeline on explicit user intent, leveraging LLMs for end-to-end reasoning across user interest mining, item retrieval, and explanation generation. The framework is motivated by the limitations of traditional log-fitting approaches, which overfit to historical co-occurrence patterns, reinforce filter bubbles, and exacerbate long-tail and Matthew effects. RecGPT addresses these issues by integrating LLMs with domain-specific alignment and a human-LLM cooperative judge system, enabling scalable, interpretable, and intent-driven recommendations in a real-world e-commerce environment (Taobao).

System Architecture and Workflow

RecGPT operationalizes an LLM-driven closed-loop pipeline comprising four core modules: user interest mining, item tag prediction, item retrieval, and recommendation explanation generation.

Figure 1: The overall workflow of RecGPT, showing the integration of LLMs for user interest mining, item tag prediction, and explanation generation.

User Interest Mining

The user interest mining module employs a dedicated LLM ( $\mathcal{LLM}_{\text{UI}}$ ) to process lifelong, multi-behavior user sequences. To address context window limitations and domain knowledge gaps, RecGPT introduces a hierarchical behavioral sequence compression pipeline:

Reliable Behavior Extraction: Filters for high-signal behaviors (e.g., purchases, add-to-cart, detailed views, search queries), excluding noisy clicks.
Hierarchical Compression: Compresses item-level and sequence-level data, partitioning by time and aggregating by behavior and item, reducing sequence length while preserving temporal and co-occurrence structure.

This process enables 98% of user sequences to fit within a 128k-token context window, with a 29% improvement in inference efficiency.

Figure 2: Compression and alignment in user interest mining, including behavior extraction and hierarchical compression.

A multi-stage alignment framework is used for $\mathcal{LLM}_{\text{UI}}$ :

Curriculum Learning-based Multi-task Fine-tuning: 16 subtasks, topologically sorted by difficulty, build foundational and reasoning skills.
Reasoning-Enhanced Pre-alignment: Knowledge distillation from a strong teacher LLM (DeepSeek-R1) with manual curation.
Self-Training Evolution: Iterative self-generated data, filtered by a human-LLM judge, further refines the model.

Prompt engineering incorporates chain-of-thought (CoT) reasoning, and data quality control enforces criteria on willingness and behavioral correlation to mitigate hallucination and spurious interests.

Item Tag Prediction

The item tag prediction module ( $\mathcal{LLM}_{\text{IT}}$ ) generates fine-grained, semantically rich item tags aligned with inferred user interests. The alignment process mirrors that of user interest mining, with additional constraints in prompt design to enforce:

Interest consistency
Diversity (minimum 50 tags)
Semantic precision
Temporal freshness
Seasonal relevance

Tags are generated as (modifier + core-word) triplets, each linked to a user interest and rationale. Data quality control enforces relevance, consistency, specificity, and validity.

Figure 3: Two-stage alignment and incremental learning for item tag prediction, with data cleaning and adaptation to online feedback.

A bi-weekly incremental learning strategy adapts $\mathcal{LLM}_{\text{IT}}$ to evolving user and item distributions, using data purification, interest completion, and data balancing to construct high-quality, balanced training sets. This yields a 1.05% improvement in HR@30 for tag prediction accuracy.

User-Item-Tag Retrieval

RecGPT introduces a tri-tower retrieval architecture, jointly optimizing user, item, and tag representations:

Item Tower: Embeds item features (sparse and dense) via DNNs.
User Tower: Aggregates multi-behavioral sequences and user ID.
Tag Tower: Encodes LLM-generated tags via mean-pooled token embeddings and DNNs.
Figure 4: User-Item-Tag Retrieval Framework, combining semantic and collaborative signals for candidate selection.

Retrieval scores are computed as:

Collaborative: $\hat{y}_{\text{col}} = \mathbf{h}_u^T \mathbf{h}_v$
Semantic: $\hat{y}_{\text{sem}} = \mathbf{h}_t^T \mathbf{h}_v$

Optimization combines collaborative and semantic contrastive losses, with category-aware regularization. Online inference fuses user and tag tower outputs via a tunable $\beta$ parameter, enabling dynamic control over semantic vs. collaborative emphasis.

Recommendation Explanation Generation

A dedicated LLM ( $\mathcal{LLM}_{\text{RE}}$ ) generates personalized, context-aware explanations for each recommendation, enhancing transparency and user trust. The alignment process includes reasoning-enhanced pre-alignment and self-training, with prompt engineering enforcing relevance, factuality, clarity, and safety.

Figure 5: Task alignment and offline production for recommendation explanation generation.

To meet latency constraints, explanations are precomputed offline for interest-item pairs and served via a lookup table at inference time.

Human-LLM Cooperative Judge System

To ensure data and model quality at scale, RecGPT employs a hybrid human-LLM judge system:

LLM-as-a-Judge: Fine-tuned on human-annotated data, with data rebalancing (minority class augmentation, recency-prioritized downsampling) to mitigate majority class bias.
Human-in-the-Loop: Milestone-based human supervision triggers retraining when LLM judge performance degrades due to distribution shift.
Figure 6: Human-LLM Cooperative Judge System, enabling scalable, adaptive evaluation and curation.

This system enables efficient, reliable evaluation and curation across user interest mining, item tag prediction, and explanation generation, with strong alignment to human standards.

Empirical Results and Analysis

Online A/B Testing

RecGPT was deployed in the "Guess What You Like" scenario on Taobao, serving 1% of top active users. Key findings:

User Experience: CICD +6.96%, DT +4.82%, EICD +0.11%
Platform Metrics: CTR +6.33%, IPV +9.47%, DCAU +3.72%, ATC +3.91%
Merchant Fairness: More uniform CTR and PVR across item popularity groups, mitigating the Matthew effect and improving long-tail exposure
Figure 7: Online performance of RecGPT vs. baseline, showing improvements in CTR, IPV, DCAU, CICD, and DT, and more equitable performance across item popularity groups.

Human and LLM-Judge Evaluation

Fine-tuned LLM judges (Qwen3-Judge-SFT) achieve high agreement with human annotators across all tasks (e.g., 93.08% accuracy for item tag prediction, 89.76% for explanation generation), enabling scalable, reliable evaluation.

Case Studies

Figure 8: Case studies demonstrating RecGPT's ability to mine nuanced interests and generate contextually relevant recommendations and explanations.

User Experience Investigation

A user paper with 500 participants showed a reduction in perceived recommendation redundancy (from 37.1% to 36.2%), with the effect more pronounced in top recommendation slots and when controlling for ad interference.

Implementation Considerations

Model Efficiency: FP8 quantization, KV caching, and MoE architectures (e.g., TBStars-42B-A3.5B) enable scalable deployment with only 3.5B active parameters per inference.
Sequence Handling: Hierarchical compression and context engineering are critical for fitting ultra-long user histories within LLM context windows.
Incremental Learning: Regular online adaptation is necessary to track evolving user and item distributions.
Evaluation: Automated LLM-judge systems, aligned with human standards, are essential for scalable, high-quality curation and monitoring.

Limitations and Future Directions

Ultra-Long Sequence Modeling: 2% of user sequences still exceed context limits; further advances in context engineering and memory management are needed.
Multi-Objective Joint Optimization: Current pipeline is modular; future work will explore RL-based joint optimization across all tasks using unified online feedback.
End-to-End LLM-as-a-Judge: Current evaluation is task-specific; future work will develop holistic, RLHF-trained judge models for integrated, multi-task assessment.
Figure 9: Curriculum learning-based multi-task fine-tuning framework for progressive domain adaptation.

Conclusion

RecGPT demonstrates that intent-centric, LLM-driven recommendation pipelines can be deployed at industrial scale, yielding measurable improvements in user experience, platform engagement, and merchant fairness. The architecture's modularity, interpretability, and adaptability—enabled by systematic alignment, efficient model design, and scalable evaluation—provide a robust foundation for future research and deployment of LLM-based recommender systems. The framework's success in a billion-scale real-world environment validates the practical viability of LLMs-for-RS and sets a new standard for intent-driven, explainable, and fair recommendation in dynamic, large-scale ecosystems.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (54)

First 10 authors:

Tweets

https://twitter.com/_akhaliq/status/1951294842375381132

https://twitter.com/HuggingPapers/status/1951313965876912209

https://twitter.com/ResearchBitesAI/status/1952021711345594797

YouTube

Show All Videos

alphaXiv

RecGPT Technical Report (54 likes, 0 questions)