Papers
Topics
Authors
Recent
Search
2000 character limit reached

Break the Inaccessible Boundary: Distilling Post-Conversion Content for User Retention Modeling

Published 28 Apr 2026 in cs.IR | (2604.25839v1)

Abstract: User retention is a key metric to measure long-term engagement in modern platforms. In real-time bidding (RTB) advertising system for user re-engagement, the retention model is required to predict future revisit probability at bidding time, before the user converts and consumes any content. Although post-conversion content, termed Onboarding Content, provides highly informative signals for retention prediction, directly using it in training causes severe feature leakage and creates a gap between training and serving. To address this issue, we propose OCARM, a two-stage distillation-aligned framework for Onboarding Content Augmented Retention Modeling, enabling the model to implicitly capture future content using only observable features during inference. In the first stage, we deliberately expose onboarding content to train a hierarchical encoder that produces teacher representations. In the second stage, a user encoder is aligned with the frozen teacher through distillation, allowing the model to approximate the inaccessible onboarding signals without leakage. Extensive offline experiments and online A/B tests demonstrate that our framework achieves consistent improvements in a real-world growth scenario.

Summary

  • The paper introduces OCARM, a two-stage distillation framework that uses onboarding content to create a leakage-free, temporally aware teacher-student model for retention prediction.
  • It employs a Hierarchical Attention Encoder and a Sequence Fusion Encoder to capture temporal dependencies and align representations without feature leakage.
  • Experimental results on a large-scale dataset demonstrate significant improvements in AUC and GAUC, confirming OCARM's effectiveness in enhancing retention metrics.

Distilling Post-Conversion Content for Retention Modeling: OCARM Framework

Introduction and Motivation

Retention modeling is pivotal for the long-term growth of recommender-driven platforms, as it captures post-conversion user engagement more robustly than immediate response signals. In real-time bidding (RTB) advertising scenarios, the retention model’s prediction must be executed before the user returns and experiences any in-app content, rendering subsequent onboarding content inaccessible during the crucial bidding decision. However, empirical analysis highlights that onboarding content—user interactions and content consumption immediately after platform re-entry—embodies highly predictive signals for future retention. Training with these post-conversion features results in severe feature leakage, producing a distribution gap between training and deployment phases and ultimately undermining real-world performance. Figure 1

Figure 1: The temporal inaccessibility boundary in the user retention journey; bidding decisions occur before onboarding content is available.

OCARM: Two-Stage Distillation-Aligned Retention Modeling

To address the dichotomy between highly informative but inaccessible onboarding content and the necessity to prevent feature leakage, the OCARM framework implements a two-stage distillation strategy. Stage 1 deliberately introduces onboarding content as side information during training to learn a hierarchical, temporally aware encoding—serving as the teacher representation. Stage 2 then encapsulates the prediction model as a student, leveraging only observable features and learning to approximate the teacher’s onboarding content representation through representation-level distillation. The approach ensures that at inference, only permissible, pre-conversion features are used, effectively mitigating the distribution mismatch while benefiting from the implicit structure of future content consumption. Figure 2

Figure 2: Overview of OCARM’s two-stage framework: Stage 1 generates teacher representations via hierarchical attention encoder (HAE) using leaked onboarding content; Stage 2 distills student representations from observable features for deployment.

Stage 1: Hierarchical Attention Encoder with Onboarding Content

The framework utilizes a hierarchical attention encoder (HAE) that exploits both intra-day and inter-day temporal dependencies within the multi-day, multi-event onboarding content sequence. By using cross-attention mechanisms conditioned on user profiling and causal self-attention across days, the model compacts onboarding interactions into semantically rich representations. This expressive onboarding encoding is tailored via task-specific towers for various retention horizons (e.g., LT1, LT7), and the associated representation is jointly optimized with retention prediction losses to maximize task alignment.

Stage 2: Sequence Fusion Encoder with Representation Distillation

The student network employs a Sequence Fusion Encoder (SFE), which processes observable user-side features—static profiles, historical interactions, and ad contexts—using cross-modal Q-Former modules to synthesize a unified user representation. Through distillation, the student optimizes a representation-alignment loss (e.g., cosine similarity) with the frozen teacher while simultaneously maintaining predictive supervision for the retention task. During deployment, only the student encoder and retention head are active; all computations are strictly leakage-free.

Experimental Evaluation

The framework's empirical evaluation leverages a large-scale industry dataset comprising millions of users and billions of interaction records gathered from a commercial short-video platform. Two principal prediction tasks, LT1 and LT7 (1-day and 7-day retention), are assessed using AUC and GAUC.

Quantitative offline analysis reveals that models trained with onboarding content (leakage enabled) establish a strong upper bound for retention prediction. When only the student (Stage 2) is used without prior alignment, performance deteriorates—attributed to unstable or collapsed representations in the absence of a robust teacher. The full OCARM pipeline, which first learns high-quality teacher representations then distills to the student, achieves clear and consistent improvements over baseline models lacking access to onboarding content.

Architecture ablations show progressive gains: simple MLP encoders (both teacher and student) underperform, while replacing with HAE for the teacher and SFE for the student yields the strongest lift, confirming the necessity of expressivity and alignment in both encoders.

Representation Similarity and Performance

A critical mechanistic observation is the positive monotonic correlation between the similarity of user and onboarding content representations and the retention prediction performance. As the model better aligns user-side estimations with teacher-provided representations (manifesting higher similarity), it more effectively bridges the inaccessibility gap, directly improving retention metrics. Figure 3

Figure 3

Figure 3: Performance improvement of retention prediction as a function of increased representation similarity—demonstrating that better distillation translates into stronger retention gains.

Online Deployment and Practical Implications

A/B tests conducted in a live industrial RTB system demonstrate that OCARM yields significant improvements in re-engaged device rates and long-term retention, most notably for previously uninstalled user cohorts. This result underscores the framework's efficacy in addressing user segments with inherently low re-engagement probabilities. The approach establishes a practical path for leveraging post-conversion signals without incurring feature leakage in real-world applications, operationalizing the synergy across growth and recommendation systems.

Theoretical Implications and Future Directions

OCARM’s two-stage pipeline formalizes a strategy to approximate counterfactual and temporally unavailable information through latent space distillation. This paradigm, situated at the boundary of causal inference and deep representation learning, highlights the power of structured knowledge transfer even under strict feature visibility constraints. Future development may explore more fine-grained sequential and cross-domain distillation objectives, adaptive teacher-student dynamics, or reinforcement-informed supervisory signals—targeting the residual upper-bound gap revealed in current evaluations. Expansion into cold-start and multi-modal retention prediction pipelines represents another promising line of inquiry.

Conclusion

OCARM advances retention modeling by exploiting onboarding content as a latent teacher during training, aligning student models in the absence of explicit post-conversion data. The methodology refines the boundary between feature exploitation and leakage prevention, yielding both robust offline lifts and significant online retention gains in production-scale systems. The framework delineates a new direction for distillation-based transfer in temporally constrained sequential modeling architectures.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.