Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration

Published 24 May 2026 in cs.LG, cs.AI, and cs.IR | (2605.24989v1)

Abstract: Scaling test-time compute has proven highly effective for LLMs, yet this opportunity remains largely unexplored for industrial Click-Through Rate (CTR) prediction. CTR models suffer from a fundamental asymmetry: feature combinations well-represented in training yield confident predictions, while sparsely observed ones produce unreliable outputs. Existing training-phase solutions such as adaptive gating learn a fixed selection function subject to the same sparsity, offering no per-instance recourse at deployment.We propose UTTSI (Uncertainty-Triggered Test-Time Selective Inference), a training-free model-agnostic framework that scales inference depth proportionally to per-instance uncertainty. A dual-signal estimator combining model logit confidence with a data-level frequency prior distinguishes epistemic uncertainty from aleatoric ambiguity. Every instance undergoes adaptive feature filtering to remove unreliable embeddings; uncertain instances additionally receive stochastic feature-path explorations whose predictions are aggregated via consistency-weighted ensembling. Confident instances bypass exploration entirely, keeping average overhead at approximately $2.8\times$ base model cost with worst-case latency unchanged.Experiments on four datasets with three backbone architectures demonstrate consistent, statistically significant gains over all training-phase baselines. A seven-day online A/B test further confirms a 5.3% relative CTR gain ($p < 0.01$), establishing selective test-time compute allocation as a practical complement to training-phase advances for CTR prediction.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a test-time optimization framework (UTTSI) for CTR prediction, leveraging per-instance uncertainty to scale compute resources efficiently.
UTTSI's dual-signal uncertainty estimation combines data-level frequency confidence with model-level logit confidence to guide selective inference depth.
Experimental results show UTTSI significantly improves CTR prediction metrics, with a 5.3% uplift in real-world application.

Selective Test-Time Compute Scaling for CTR Prediction via UTTSI

Motivation and Problem Formulation

Click-through rate (CTR) prediction underpins industrial recommendation systems, relying on deep models to estimate user-item click probabilities. Despite advances in feature interaction modeling and generative paradigms, prior work has focused exclusively on training-phase optimization. A fundamental challenge persists: inference instances exhibit heterogeneous prediction reliability due to feature combination sparsity, especially when tail-value embeddings arise from rarely observed feature configurations. Existing adaptive gating or feature selection mechanisms learned at training suffer from the same sparsity and offer no plasticity at deployment, resulting in unreliable predictions for a substantial fraction of instances. The paper addresses the asymmetry between training (learning generalizable representations across feature diversity) and inference (requiring robust decisions only over well-learned combinations), and proposes a principled test-time optimization framework to allocate computation selectively based on per-instance predictive uncertainty.

UTTSI Framework: Methodology

The proposed Uncertainty-Triggered Test-Time Selective Inference (UTTSI) represents a model-agnostic, training-free plug-in applied post-training to any backbone CTR model. It introduces three interconnected modules:

Frequency Prior Estimation: UTTSI leverages a Count-Min Sketch-based probabilistic hashing structure to densely index feature value frequencies in the training corpus. This data-level prior quantifies representation reliability for each feature, enabling downstream uncertainty discrimination without excessive memory overhead.

Dual-Signal Uncertainty Estimation: For each instance, UTTSI computes the prediction logit and associated input embedding attribution (gradient norm), yielding a model-internal confidence measure. Separately, an attribution-weighted aggregation of normalized feature frequencies forms the frequency prior. The uncertainty score $u(x)$ is a convex combination of model confidence and frequency confidence, balancing epistemic uncertainty (from data sparsity) and aleatoric ambiguity (from decision boundary proximity). This score continuously determines the selective inference depth for each instance: the number of exploration paths $K(x) = \lfloor K_{max} \cdot u(x) \rfloor$ .

Feature Filtering and Path Exploration: All instances undergo adaptive feature filtering, discarding unreliable features below per-field thresholds computed from composite frequency-attribution scores (accounting for inter-field heterogeneity). Uncertain instances ( $K(x)>0$ ) trigger stochastic feature-path exploration: iterative Bernoulli sampling constructs diverse feature subsets guided by composite reliability, with multiple parallel paths aggregated via consistency-weighted ensembling that penalizes outlier predictions. Confident instances ( $K(x)=0$ ) bypass the exploration step, incurring only filtering overhead.

Experimental Results and Analysis

Empirical validation spans four benchmark datasets (Criteo, Avazu, KDD12, industrial) and three backbone architectures, including OptFu, HSTU, and PLE. UTTSI demonstrates consistent, statistically significant improvements in AUC and logloss relative to both classical and state-of-the-art feature interaction, multi-expert, and NAS-based models. On sparse-feature datasets (KDD12, Industrial), the gains are more pronounced, aligning with UTTSI's targeted mitigation of tail-value uncertainty.

Ablation studies dissect the contributions of UTTSI's modules:

Removing dual-signal uncertainty estimation (logit-only confidence) reduces calibration and performance, confirming the necessity of the frequency prior for flagging unreliable representation.
Random feature sampling (removing attribution guidance) degrades performance, especially in high-dimensional regimes, indicating the importance of prioritizing predictive features during path exploration.
Using only single-path inference per instance (removing ensembling) lowers robustness and reduces gains, substantiating the role of multi-path aggregation in variance reduction.

Hyperparameter sensitivity analysis reveals that performance is robust across reasonable ranges for $K_{max}$ (optimal at 8), dual-signal weight $\alpha$ , filtering quantile $\rho$ , composite score balance $\beta$ , and ensemble sharpness $\lambda$ . The average compute overhead is $2.8\times$ the base model, but worst-case latency is bounded equivalently, thanks to parallelization.

Calibration analysis confirms that dual-signal uncertainty scores correlate strongly with prediction error (Spearman $K(x) = \lfloor K_{max} \cdot u(x) \rfloor$ 0 up to 0.91), outperforming logit-only alternatives. Stratified evaluation shows that UTTSI yields gains across all uncertainty levels, with the largest benefit in the high-uncertainty subgroup, while low-uncertainty samples also profit from filtering.

A seven-day online A/B test on an e-commerce platform delivers a 5.3% uplift in actual CTR, validating the practical gains of selective inference. Frequency prior maintenance remains tractable via daily incremental sketch updates.

Theoretical and Practical Implications

UTTSI advances CTR prediction by introducing a principled test-time optimization paradigm that non-uniformly allocates inference resources according to per-instance uncertainty. Its design is congruent with industrial constraints: full compatibility with frozen models, parallelizable computation, negligible impact on latency SLAs, and adaptive integration with evolving data distributions.

Theoretically, UTTSI decouples the objectives of training from inference, formalizing bias-variance decomposition at prediction time and operationalizing uncertainty-driven exploration without retraining. The dual-signal uncertainty score—combining data-level and model-level signals—may inspire analogous selective inference in other sparse-input domains.

Practically, UTTSI enables serving architectures to maximize utility from fixed deployed models, investing computationally only where it most improves outcome metrics. Its plug-and-play character makes it widely applicable, and the substantial improvements in online CTR suggest a strong ROI when layered atop expensive, large-scale CTR backbones.

Future Directions in AI

Future AI developments may extend UTTSI's selective inference design to other domains where input sparsity, feature combination explosion, or uncertainty quantification are challenging—such as recommender systems with graph or sequence augmentation, medical or financial predictive modeling, and more generally, settings with high epistemic uncertainty due to distributional shift.

Further research may generalize the consistency-weighted aggregation to deeper or hierarchical ensembles, explore reinforcement learning for path allocation, and investigate joint train-time and test-time uncertainty optimization. Dynamic resource-constrained serving could integrate UTTSI's uncertainty signals as scheduling priorities in production systems.

Conclusion

Selective test-time compute scaling via UTTSI constitutes a formally-founded, empirically validated, and operationally efficient framework for CTR prediction, compatible with all mainstream backbones and tractable in industrial deployment. By quantifying and targeting predictive uncertainty, it delivers consistent gains with controlled overhead, introducing selective inference as a necessary complement to training-phase improvements in CTR and broader AI systems (2605.24989).

Markdown Report Issue