Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Hierarchical Group-wise Ranking Framework

Updated 30 June 2025

Hierarchical group-wise ranking is a framework that partitions users using residual vector quantization to form multi-level groups for improved recommendation tasks.
It applies a listwise cross-entropy loss within each group, creating a curriculum from coarse negatives to hard negatives for more effective ranking.
Empirical results demonstrate enhanced calibration and GAUC performance, maintaining scalability and serving compatibility in real-world recommender systems.

A hierarchical group-wise ranking framework is a system for improving learning-to-rank objectives in recommendation models by partitioning the user space into recursively finer groups and optimizing ranking losses within each group. This approach is motivated by the need to present more informative negatives to ranking models—negatives that reflect realistic competition for user attention and expose user-item preferences more effectively than conventional in-batch negative sampling. The framework relies on hierarchical clustering of user embeddings via residual vector quantization (RVQ) to create a scalable, trie-like structure of user groups. Within each group at each depth of the hierarchy, a listwise loss is applied over the associated user-item samples, producing a multi-level, curriculum-like progression from easy negatives (coarse groups) to hard negatives (fine groups). This enables improved calibration and ranking performance, all without the need for complex retrieval architectures or dynamic context collection.

1. Hierarchical Group Partitioning with Residual Vector Quantization

The framework’s first component is the generation of hierarchical user codes using RVQ. Let $\mathbf{e}_u \in \mathbb{R}^d$ denote a user’s continuous embedding. RVQ encodes $\mathbf{e}_u$ as a sequence of discrete code indices $[\mathbf{c}_{u,1}, ..., \mathbf{c}_{u,L}]$ by iteratively quantizing the residual:

$\begin{aligned} \mathbf{r}_u^{(1)} &= \mathbf{e}_u \ \mathbf{c}_{u,l} &= \arg\min_k \|\mathbf{r}_u^{(l)} - \mathcal{C}^{(l)}_k\|_2^2 \ \mathbf{r}_u^{(l+1)} &= \mathbf{r}_u^{(l)} - \mathcal{C}^{(l)}_{\mathbf{c}_{u,l}} \end{aligned}$

where $\mathcal{C}^{(l)}_k$ are learnable codewords (codebook entries) at stage $l$ . Codebooks are maintained using exponential moving averages, and rarely used codes are dropped and refreshed to avoid collapse.

Users sharing the same prefix code $[\mathbf{c}_{u,1}, ..., \mathbf{c}_{u,l}]$ are allocated to the same group at level $l$ , forming a trie structure over the user population. At shallow hierarchy levels, groups are coarse (loosely similar users); at deeper levels, groups are finer (highly similar users). This structure provides both scalability and relevance in partitioning the space for group-wise ranking objectives.

2. Group-wise Listwise Ranking Loss Across Hierarchy Levels

Within each group $G^{(l)}_m$ at hierarchy level $l$ , user-item prediction pairs are subject to a regression-compatible listwise cross-entropy loss (ListCE):

$\mathcal{L}^{(l)}_{\mathrm{listce}}(s, y) = \frac{1}{M_l} \sum_{m=1}^{M_l} \sum_{i \in G^{(l)}_m} -\tilde{y}_i^{(l,m)} \log \left( \frac{\sigma(s_i)}{\sum_{j \in G^{(l)}_m} \sigma(s_j)} \right)$

Here, $s_i$ is the predicted score for user-item pair $i$ , $\sigma(\cdot)$ is the sigmoid function, $y_i$ is the binary label, and $\tilde{y}_i^{(l,m)}$ is the label normalized within the group:

$\tilde{y}_i^{(l,m)} = \frac{y_i}{\sum_{j \in G^{(l)}_m} y_j + \epsilon}$

A multi-objective training regime combines (a) the usual logloss over all user-item pairs, (b) a logloss on quantized (RVQ) user representations to enforce calibration, and (c) the sum of group-wise ranking losses across hierarchy levels. To balance their impact, the framework uses an uncertainty-weighted loss aggregation (as per Kendall et al., 2018):

$\mathcal{L}_{\text{hierarchical}} = \sum_{l=1}^L \left( \frac{1}{2\sigma_l^2} \mathcal{L}^{(l)}_{\text{listce}}(s, y) + \log \sigma_l \right)$

with trainable uncertainty parameters $\sigma_l$ for each level $l$ .

3. Connection to Hard Negative Mining and Model Calibration

Optimal training of ranking objectives generally requires hard negatives—examples that are most likely to induce model mistakes and trigger large gradients. This is often implemented via dynamic online selection or large-batch negative mining, which can be computationally expensive or require retrieval infrastructure. In the presented framework, hard negatives are efficiently approximated by restricting listwise losses to ever-finer groups of users: as user similarity within groups increases, so too does the difficulty of negative examples, since items consumed or considered by almost-identical users are more confounding for the ranking model.

Theoretical analysis in the paper shows that the optimal negative sampling corresponds to sampling negatives in proportion to their gradient norms. The hierarchical grouping by RVQ codes provides a scalable, effective approximation of this principle—especially at deeper levels of the code trie.

A secondary benefit is improved model calibration. By combining ranking loss over hierarchical groups with auxiliary (quantized) calibration objectives, the model’s probability outputs gain in reliability and interpretability for downstream decision-making.

4. Mathematical Formulations and Overall Objective

The main loss function integrates all components:

$\mathcal{L}_{\text{loss}} = \mathcal{L}_{\text{logloss}}(\hat{y}, y) + \lambda \mathcal{L}_{\text{logloss}}(\hat{y}^q, y) + \mathcal{L}_{\text{hierarchical}}$

where $\lambda$ is a hyperparameter balancing the auxiliary quantized calibration loss and $\hat{y}^q$ denotes predictions from the quantized representation. The hierarchical ranking loss $\mathcal{L}_{\text{hierarchical}}$ is as above.

5. Empirical Validation and Performance

The framework’s effectiveness is demonstrated on large-scale, real-world datasets including KuaiRand (video) and Taobao (e-commerce). Key findings:

Ranking performance: The GroupCE framework increases test GAUC compared to logloss, pairwise, listwise, and recent state-of-the-art objectives, demonstrating consistent improvements in both ranking and calibration.
Cold-start users: The group-wise hierarchy yields superior GAUC (e.g., 0.6786 vs. 0.6718) for users with little history, indicating better generalization to sparse data regions thanks to group-level matching.
Ablation studies: Each major component (the hierarchical loss and the quantized auxiliary calibration loss) is essential, with notable degradation if either is omitted.
Industrial practicality: The approach is efficient, scalable, and serving-compatible, as it requires only batch-level group partitioning without recourse to dynamic context streaming or approximate nearest neighbor infrastructure.

Objective	KuaiRand GAUC	Taobao GAUC
LogLoss	0.6911	0.5708
LogLoss + Pairwise	0.6921	0.5728
LogLoss + ListwiseCE	0.6932	0.5734
JRC	0.6930	0.5732
GroupCE (proposed)	0.6953	0.5745

Performance advantages stem directly from the group-wise curriculum: shallow groups provide broad, low-difficulty negatives, while deep groups supply hard, informative counterexamples without costly label mining.

6. Scalability and Practical Deployment

The framework’s trie-based grouping is batch-parallelizable and computed without external retrieval dependencies. Residual vector quantization supports extremely large user populations by hierarchical code reuse; codebook maintenance via EMA ensures training stability. This design results in a system that can be deployed within existing large-scale recommendation serving stacks, as no special online computation or infrastructure is needed at inference time.

7. Impact and Implications

The hierarchical group-wise ranking framework represents a principled solution to the core challenge of providing effective, scalable negative sampling for industrial learning-to-rank in recommendation. By combining multi-level grouping with listwise optimization, it robustly bridges the gap between classic supervised ranking and hard negative mining-based methods, without incurring their operational costs. The approach generalizes to any scenario where user similarity can be reliably quantized, and supports improved item discovery, user engagement, and overall relevance in production-scale recommender systems.

PDF Markdown Chat (Upgrade)