Normalized Discounted Cumulative Gain

Updated 14 April 2026

Normalized Discounted Cumulative Gain (NDCG) is a ranking metric that quantifies the quality of ranked lists using logarithmic discounting to emphasize top-ranked items.
It is widely used in information retrieval and recommendation systems to evaluate models by comparing actual rankings to an ideal order, ensuring cross-query comparability.
Recent developments include differentiable surrogates like NeuralNDCG and variants such as NEDCG, which address optimization challenges and improve early enrichment detection.

Normalized Discounted Cumulative Gain (NDCG) is a canonical metric for evaluating the quality of ranking algorithms, particularly in information retrieval, recommender systems, and learning-to-rank pipelines. NDCG quantifies not only whether highly relevant items are present among the top ranks but also the order in which they appear, assigning exponentially higher importance to early ranked, highly relevant results. Its theoretical properties, practical variants, limitations, and differentiable surrogates underpin most modern evaluations of listwise machine learning models.

1. Formal Definition and Mathematical Properties

Let a ranked list of $K$ items (e.g., documents, compounds, recommendations) be assigned nonnegative relevance scores $(rel_1, ..., rel_K)$ . The NDCG computation proceeds via the following steps:

Discounted Cumulative Gain (DCG@K):

$\mathrm{DCG@}K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$

The discount denominator $\log_2(i+1)$ down-weights gains from lower-ranked (i.e., deeper in the list) items, capturing the principle that mistakes at top ranks are more critical.

Ideal DCG (IDCG@K):

$\mathrm{IDCG@}K = \max_{\pi} \sum_{i=1}^{K} \frac{rel_{\pi(i)}}{\log_2(i+1)}$

where $\pi$ ranges over all permutations, implemented in practice by sorting true relevances in descending order.

Normalized DCG (NDCG@K):

$\mathrm{NDCG@}K = \frac{\mathrm{DCG@}K}{\mathrm{IDCG@}K}$

By construction, $0 \leq \mathrm{NDCG@}K \leq 1$ . A value of $1$ denotes a perfect ordering.

A typical gain function is $gain_i = 2^{rel_i} - 1$ , amplifying the contribution of higher relevance levels and reflecting applications (e.g., aligning with user-perceived value, IR conventions) (Zhu et al., 2020, Zhang et al., 2021, He et al., 2017).

2. Discount Functions, Relevance Scales, and Interpretations

The logarithmic discount $(rel_1, ..., rel_K)$ 0 is used universally because it models rapid decay in user/click attention to depth. Alternative discount sequences—polynomial, Zipfian ( $(rel_1, ..., rel_K)$ 1)—have been analyzed for discriminative power and convergence properties (Wang et al., 2013). Choosing the discount impacts how much separation the measure induces in large rankings:

Logarithmic discounts ensure consistent distinguishability: while NDCG converges to 1 as $(rel_1, ..., rel_K)$ 2, differences between models remain statistically meaningful at achievable sample sizes.
A decay faster than $(rel_1, ..., rel_K)$ 3 destroys this property; too-slow decay weakens top-K sensitivity (Wang et al., 2013).

Relevance values may be binary, ordinal, or even continuous:

Standard NDCG employs discrete, hand-specified levels (e.g., 0–4 in LETOR, or 0/1).
Data-driven relevance: Recent work replaces ad hoc levels with transformations of observed scores (e.g., news article popularity, biochemical activity) via shape-preserving polynomial interpolation over control points, yielding a continuous, empirically grounded gain curve. This supports finer differentiation where adjacent scores are close (Moniz et al., 2016).

3. Strengths, Limitations, and the NEDCG Extension

Strengths:

Cross-query comparability: Since IDCG normalizes for label set and query difficulty, NDCG permits aggregating scores across diverse queries/contexts (Furui et al., 2022).
Position-sensitivity and boundedness: The metric is both top-heavy and interpretable via a simple scale.

Known limitations:

NDCG is always nonnegative. Values near 0 may result even from sub-random orderings; NDCG cannot signal if predictions are actually worse than random (Furui et al., 2022).
In the context of virtual compound screening, GBDT regression may yield NDCG@10 ≈ 0.404—ostensibly "moderate"—yet corresponding early enrichment may be statistically below random.

NEDCG (Normalized Enrichment DCG): To address this, NEDCG introduces an explicit baseline:

$(rel_1, ..., rel_K)$ 4

$(rel_1, ..., rel_K)$ 5

Under random assignment, NEDCG ≈ 0; a perfect list gives 1. For sub-random models, NEDCG < 0. This variant exposes failure modes of standard NDCG and delivers a true signal of enrichment over chance in early retrieval (Furui et al., 2022).

4. Surrogate Losses and Differentiable Optimization

NDCG’s non-differentiability arises from its reliance on discrete sorting: the rank operator is piecewise constant in predicted scores, with undefined or zero gradient almost everywhere. This complicates direct gradient-based optimization, which is central to deep learning and large-batch learning-to-rank regimes (Pobrotyn et al., 2021, Qiu et al., 2022).

Recent advances include:

NeuralNDCG: Uses NeuralSort, a continuous relaxation of sorting, to produce a soft permutation matrix $(rel_1, ..., rel_K)$ 6 over the $(rel_1, ..., rel_K)$ 7 items. Gains are soft-assigned to ranks, and the surrogate metric converges to true NDCG as temperature $(rel_1, ..., rel_K)$ 8. Sinkhorn scaling ensures numerical stability. Both "row" and "column transposed" formulations are supported; in practice, they yield empirically competitive and stable losses for direct NDCG maximization (Pobrotyn et al., 2021).
Pairwise and compositional surrogates: To enable stochastic optimization, rank can be approximated with pairwise hinge-based surrogates (e.g., $(rel_1, ..., rel_K)$ 9), enabling compositional or bilevel objectives for scalable deep learning (Qiu et al., 2022, An et al., 2023).
Listwise surrogates for learning-to-rank: Hybrid losses incorporating DCG-style gain at each loss step, tie-awareness, and direct NDCG calibration, as in boosted tree and neural models (He et al., 2017, Zhu et al., 2020).

5. Operational Considerations and Implementation Issues

Subtle implementation details in NDCG can affect empirical comparisons and reproducibility:

Handling of IDCG with insufficient positives: Some libraries compute the ideal DCG over all $\mathrm{DCG@}K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$ 0 slots even if fewer than $\mathrm{DCG@}K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$ 1 relevant items exist, artificially lowering NDCG when positive labels are rare. Corrected versions truncate IDCG to actual positives (Schmidt et al., 2024).
Tie handling: Especially with low-cardinality or discrete relevance and in binary hashing (Hamming distances), NDCG’s sensitivity depends on how ties are broken. Tie-aware averaging is analytically tractable for NDCG and necessary for fair benchmarking (He et al., 2017).
Pooling and normalization: Aggregating NDCG across queries/contexts is not always order-preserving. That is, $\mathrm{DCG@}K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$ 2-ordering and $\mathrm{DCG@}K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$ 3-ordering are consistent per-query, but pooling and averaging can invert relative method rankings (Jeunen et al., 2023). Inconsistent normalization can thus complicate off-policy evaluation, especially in recommender systems.

Implementation Pitfall	Empirical Impact (Schmidt et al., 2024)	Resolution
Mismatched IDCG definition	up to 18% NDCG deviation	Truncate IDCG to available positives
Dense vs. Top-K similarity	≈ 10–20% NDCG drop with dense matrix	Use Top-K truncation
Tie breaking	Varies by ranking library	Use explicit tie-aware NDCG

6. Adaptations and Application Domains

NDCG’s sensitivity to position and gain makes it the standard in:

Learning-to-Rank for IR, Recommendation, and NAS: Models are trained using losses that maximize NDCG or surrogates thereof. In Neural Architecture Search, NDCG correlates more strongly with final top-k accuracy than Kendall’s tau, especially when the task is to surface the single best or top few architectures (Zhang et al., 2021).
Preference Optimization and Human Feedback Alignment: OPO and related algorithms in LLM preference alignment leverage NDCG over multiple candidate responses, using differentiable relaxations (NeuralSort + Sinkhorn) for listwise optimization over ordinal human feedback (Zhao et al., 2024).
Compound and Virtual Screening: High-throughput settings require strong "early enrichment." Here, NEDCG is favored to directly report improvement over random, which NDCG cannot express for models with sub-random ranking (Furui et al., 2022).
Spatiotemporal Event Ranking: Hybrid NDCG losses, incorporating local (region-level) and global ranking, improve the identification of top-risk locations in urban event forecasting (An et al., 2023).

7. Theoretical and Practical Limitations

Convergence to 1: As $\mathrm{DCG@}K = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$ 4 with standard log-discounts, NDCG converges to 1 for all scoring functions, yet still consistently distinguishes better from worse models via deviation magnitudes—the separation persists at realistic dataset sizes (Wang et al., 2013).
Off-policy inconsistency: For batch or offline evaluations of recommendation algorithms, normalized DCG may fail to monotonically reflect true online reward due to inconsistent normalization over pooled queries. Unnormalized DCG or pooled normalization (pnDCG) are preferred for these applications (Jeunen et al., 2023).
Non-differentiability: Many "direct" NDCG surrogates are not smooth; optimizers may require careful temperature or surrogate function tuning for stable gradient signals (Pobrotyn et al., 2021, Zhao et al., 2024).
Relevance gaps: When true item differences are large, standard NDCG may "under-penalize" egregious misorderings, while for near-tied items it can "over-penalize" harmless swaps. Data-driven relevance scaling better aligns penalization with real-world utility (Moniz et al., 2016).

In summary, NDCG and its variants provide a theoretically grounded, position-aware, and relevance-sensitive metric essential for modern ranking evaluation and optimization. Recent research advances focus on robust normalization (NEDCG), continuous relevance estimation, differentiable surrogates for deep learning, and domain-specific adaptations for IR, recommendation, and preference alignment. Methodological rigor in discount selection, normalization, tie-awareness, and practical implementation is critical to preserving NDCG’s validity across large-scale, heterogeneous, and low-signal applications.