Papers
Topics
Authors
Recent
2000 character limit reached

Popularity-Aware Interval Accuracy Metrics

Updated 25 December 2025
  • The paper presents PAIA metrics that incorporate popularity weights to offer a nuanced assessment of model performance across popular and rare items.
  • It details a methodology for stratifying data into bins using interval accuracy and Gain statistics to uncover biases in predictions.
  • The approach facilitates fairer evaluation and bias mitigation in recommender systems and vision-language models, promoting both discovery and balanced exposure.

Popularity-aware interval accuracy metrics constitute a family of evaluation tools designed to quantify and diagnose the effect of item or instance popularity on the accuracy and fairness of machine learning systems, notably in recommender systems and vision-LLMs. By stratifying or weighting predictions according to popularity, these metrics provide insight into whether models disproportionately favor popular examples at the expense of rare or long-tail cases, and thus offer a principled mechanism for detecting and eventually mitigating popularity bias in both retrieval and regression settings (Boratto et al., 2020, Szu-Tu et al., 24 Dec 2025).

1. Formal Definitions and Core Notation

Consider a generic supervised setting with NN test samples indexed by ii. Each instance is annotated with a ground-truth label yiy_i (e.g., construction year for ordinal regression, binary relevance for recommendations), a model prediction y^i\hat{y}_i, and an associated popularity score pip_i. In recommender settings, let U={u1,...,um}U = \{u_1, ..., u_m\} be users, I={i1,...,in}I = \{i_1, ..., i_n\} items, with interaction and relevance matrices denoted Rtrain(u,i)R^{\text{train}}(u,i) and Rtest(u,i)R^{\text{test}}(u,i), respectively (Boratto et al., 2020). In vision-language settings, popularity may derive from external attributes such as Wikipedia page-views (Szu-Tu et al., 24 Dec 2025).

  • Interval Accuracy (IA): For tolerance parameter τ\tau (e.g., years in date regression), define the indicator:

1i(τ)={1if y^iyiτ 0otherwise\mathbf{1}_i^{(\tau)} = \begin{cases} 1 & \text{if } |\hat{y}_i - y_i| \le \tau \ 0 & \text{otherwise} \end{cases}

with overall accuracy

IA(τ)=1Ni=1N1i(τ).\mathrm{IA}(\tau) = \frac{1}{N} \sum_{i=1}^N \mathbf{1}_i^{(\tau)}.

  • Popularity-Aware Weighting: Introduce nonnegative instance weights wi=f(pi)/jf(pj)w_i = f(p_i)/\sum_j f(p_j), where ff may be identity, log-scaling, or other monotonic transformations. The popularity-aware interval accuracy (PAIA) is defined as

PAIA(τ)=i=1Nwi1i(τ).\mathrm{PAIA}(\tau) = \sum_{i=1}^N w_i \mathbf{1}_i^{(\tau)}.

Alternatively, bin the data by popularity and report per-bin IA(τ)(\tau).

2. Bin-wise and Stratified Metrics

To enable fine-grained analysis, instances or items are partitioned into LL disjoint bins B1,,BLB_1, \ldots, B_L by quantiles or thresholding on pip_i (e.g., Wikipedia pageviews, CF popularity score). Within each bin BB_\ell:

  • Bin-wise Interval Accuracy:

IA(τ)=1BiB1i(τ).\mathrm{IA}_{\ell}(\tau) = \frac{1}{|B_\ell|} \sum_{i \in B_\ell} \mathbf{1}_i^{(\tau)}.

  • Gain Statistic: For applications with a semantically meaningful “low”- and “high”-popularity split, define

Gain(τ)=IAhigh(τ)IAlow(τ)\mathrm{Gain}(\tau) = \mathrm{IA}_{\text{high}}(\tau) - \mathrm{IA}_{\text{low}}(\tau)

to summarize the extent to which a model’s accuracy is biased in favor of, or against, the most popular examples (Szu-Tu et al., 24 Dec 2025).

In collaborative filtering (Boratto et al., 2020), related metrics include:

  • Average recommendation probability per bin: pˉrec()\bar{p}_{\text{rec}}(\ell)
  • Average true-positive-rate per bin: pˉTPR()\bar{p}_{\text{TPR}}(\ell)

Both are parametric in a cutoff kk (e.g., Top-kk recommendations) and derived by aggregating individual item or user-level statistics within each popularity interval.

3. Computational Recipes and Empirical Pipeline

Practical implementation entails the following core steps:

  1. Score Computation: For each test instance (regression) or each user-item pair (recommender), compute model prediction(s) (e.g., regression output, R^(u,i)\hat{R}(u,i)).
  2. Popularity Quantification:
    • Vision-Language: Obtain external statistics (e.g., Wikipedia views) as the popularity proxy pip_i.
    • Collaborative Filtering: Compute pop(i)=uURtrain(u,i)\text{pop}(i) = \sum_{u \in U} R^{\text{train}}(u,i) as item popularity.
  3. Interval or Bin Construction: Define bins BB_\ell by splitting the range of pip_i using fixed thresholds (e.g., [<102,102[<10^2, 10^2103,10310^3, 10^3104,...,>105]10^4, ..., >10^5]) or quantiles.
  4. Metric Aggregation: For each bin \ell, aggregate IA, pˉrec()\bar{p}_{\text{rec}}(\ell), and pˉTPR()\bar{p}_{\text{TPR}}(\ell) according to the formulas above.
  5. Optionally, Continuous Weighting: Compute PAIA(τ)(\tau) using instance-wise wiw_i (identity, log, or clipped).

A toy example illustrating the distinction between standard and popularity-aware metrics demonstrates the effect of misprediction on a high-popularity sample dominating the weighted score, even when unweighted accuracy appears superficially reasonable (Szu-Tu et al., 24 Dec 2025).

4. Diagnostic and Interpretive Significance

Popularity-aware interval accuracy metrics reveal systematic patterns not accessible via standard, population-averaged measures. Unweighted IA(τ)(\tau) or mean Top-kk recall/precision can obfuscate the fact that a model may obtain its average score by excelling in high-popularity bins and failing in the long tail, or vice versa.

  • A pronounced positive Gain implies that the model’s performance is substantially better on popular instances—a signal of memorization or overfitting to high-frequency exemplars, as observed for commercial vision-LLMs (Szu-Tu et al., 24 Dec 2025).
  • Adverse (negative) Gain or flat trends across bins suggest uniform failure or rare long-tail proficiency.
  • Analogously for recommender systems, downward-sloping pˉrec()\bar{p}_{\text{rec}}(\ell) or pˉTPR()\bar{p}_{\text{TPR}}(\ell) graphs (head \rightarrow tail) indicate diminishing exposure and true-positive ability for less popular items (Boratto et al., 2020).

Systematic stratification by popularity is crucial for diagnosing recommendation or prediction equity, especially when platform objectives include novelty, discovery, or fairness in exposure across the catalog.

Popularity-aware metrics generalize and extend beyond canonical user-averaged measures such as Precision@k, Recall@k, or global interval accuracy:

  • User-centric vs. Item-centric: Traditional evaluation averages over users; popularity-aware methods invert this, averaging over items (within bins), providing a complementary “item perspective” as advocated by Boratto et al. (Boratto et al., 2020).
  • Exposure and Equal Opportunity: pˉrec()\bar{p}_{\text{rec}}(\ell) operationalizes “statistical parity” (equal probability of recommendation across the popularity spectrum), while pˉTPR()\bar{p}_{\text{TPR}}(\ell) operationalizes “equal opportunity” (equal true-positive rate for relevant items regardless of popularity).
  • Weighted Aggregation: PAIA introduces a continuous analog by linearly weighting each instance by normalized popularity, thus modulating the influence of rare vs. common cases (Szu-Tu et al., 24 Dec 2025).

These metrics serve both as tools for algorithm audit and as quantitative targets for debiasing objectives.

6. Extensions, Practical Adjustments, and Empirical Results

Key methodological choices and scenario-specific adjustments include:

  • Bin Granularity: Bins can be defined by quantiles, deciles, or sliding windows to target specific tail intervals.
  • Weighting Schemes: Metrics may be weighted uniformly (per bin), by number of exposures, or by denominator mass to prioritize bins with higher candidate exposure.
  • Tail-focused Analysis: Analysts may restrict calculations (e.g., ISP, IEO, Gain) to the least-popular fraction for long-tail promotion diagnostics.
  • Dynamic Ground-truth: In implicit-feedback settings, Rtest(u,i)R^{\text{test}}(u,i) can be constructed at evaluation time (e.g., clicks), making pˉTPR()\bar{p}_{\text{TPR}}(\ell) a bin-wise click-through rate.

Reported experimental results in vision-LLMs on YearGuessr indicate that state-of-the-art VLMs exhibit Gains up to +34.18%+34.18\% (Gemini 2.0-flash) on the most viewed buildings, while pure vision models sometimes perform worse on high-popularity cases (negative Gain) (Szu-Tu et al., 24 Dec 2025). In recommender systems, Boratto et al. demonstrated a strong correlation between item popularity and reduced exposure/true positive rates for the long tail (Boratto et al., 2020).

7. Relation to Bias Mitigation and Future Directions

The emergence of popularity-aware interval accuracy metrics has spurred the development and evaluation of debiasing techniques in both recommendation and regression domains:

  • Algorithmic Debiasing: Approaches that aim to minimize the correlation between model predictions and item popularity can be monitored and validated using these metrics (Boratto et al., 2020).
  • Benchmarking and Model Selection: Popularity-aware metrics provide a protocol for robust reporting and comparison of models, guiding stakeholders toward systems exhibiting balanced performance.
  • Beyond-accuracy Quality Measures: They augment traditional metrics by exposing tradeoffs between accuracy, fairness, and exposure, a central concern in platforms with societal and business incentives for novelty and diversity.

A plausible implication is that broader adoption of these metrics will promote the design of fairer, more discovery-friendly algorithms that address limitations of current state-of-the-art models in both retrieval and ordinal regression tasks.


Key References

Metric/Concept Context Reference
PAIA(τ)\mathrm{PAIA}(\tau), IA_\ell Vision-language, ordinal regression (Szu-Tu et al., 24 Dec 2025)
pˉrec()\bar{p}_{\text{rec}}(\ell), pˉTPR()\bar{p}_{\text{TPR}}(\ell) Collaborative filtering, Top-kk reco. (Boratto et al., 2020)
Gain Aggregates difference across popularity (Szu-Tu et al., 24 Dec 2025)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Popularity-Aware Interval Accuracy Metrics.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube