Bias and Debias in Recommender Systems

Updated 25 February 2026

Bias and debias in recommender systems are defined by observational pitfalls like selection, exposure, and popularity biases, impairing accuracy and fairness.
Methodologies such as inverse propensity weighting, doubly robust estimators, and causal techniques offer rigorous frameworks to quantify and correct bias.
Recent advances integrate graph-based re-ranking, meta-learning, and counterfactual approaches, improving personalization while managing dynamic bias effects.

Recommender systems, which provide personalized item suggestions based on user behavior, are universally recognized as vulnerable to a broad range of biases arising from the observational nature of their input data. These biases stem from selection processes, exposure mechanisms, user–system feedback loops, and latent confounding structures. Their effect is twofold: they impair the accuracy and reliability of personalization and they can reinforce or even amplify unfairness, polarization, and filter bubbles. The challenge of debiasing these systems has produced an extensive research literature, including causal modeling, sophisticated risk weighting, re-ranking algorithms, meta-learning, and adversarial frameworks, as well as practical toolchains for evaluation on partially-observed or randomized datasets.

1. Fundamental Categories and Causal Perspectives on Bias

Research on the taxonomy of biases in recommender systems consistently identifies several archetypal forms, each with mathematically precise characterizations (Chen et al., 2020):

Selection Bias: Only a non-random subset of user–item ratings is observed, rendering the data Missing Not At Random (MNAR) (Chen et al., 2020, Saito, 2019). Formally, the probability of observing $(u,i,r)$ depends on user and item, $p(O_{ui}=1|u,i,r)\neq \textrm{constant}$ , violating the assumptions of standard ERM.
Exposure Bias: In implicit-feedback scenarios, users only interact with a subset of items they are exposed to. Negative feedback is only meaningful if exposure is guaranteed. Exposure is generally non-uniform and structured (Kim et al., 2022, Khenissi et al., 2020).
Position Bias: The probability that an item is clicked depends on its ranking position, which arises from both user attention patterns and system presentation (Wang, 2024).
Popularity Bias: Algorithms, particularly those based on collaborative signals, tend to overly recommend items that have historically been interacted with frequently—yielding the "rich-get-richer" effect (Zhu et al., 2022, Barenji et al., 2023).
Amplified Subjective Bias (Preference Amplification): The tendency to overfocus on categories a user has already favored, increasing with repeated recommendations (Guo et al., 2023).
Latent Confounder Bias: The classic causal structure in which an unobserved variable $C$ influences both exposure $A$ and observed feedback $R$ , generating spurious dependencies (Deng et al., 2024, Zhang et al., 9 Jun 2025).
Fairness-Related Bias (Unfairness): Subgroups defined by demographic or item attributes are unfairly under- or over-represented in recommendations, often due to imbalanced data or intersectional bias (Farnadi et al., 2018, Barenji et al., 2023).

Technically, these biases are typically formalized using graphical models (DAGs/SCMs), which specify the flow of information and interventions such as $do(A=a)$ to reason about counterfactuals and backdoor adjustments (Deng et al., 2024, Wang et al., 2021, Guo et al., 2023, Zhang et al., 9 Jun 2025).

2. Metrics and Empirical Manifestations of Bias

Multiple quantitative measures are used to detect, monitor, and evaluate bias and its amplification in recommender outputs:

Bias Type	Metric Example	Reference
Selection	Risk Discrepancy $\Delta L$	(Chen et al., 2021)
Exposure	Propensity score $p_{ui}$ , Exposure Gini	(Khenissi et al., 2020, Kim et al., 2022)
Position	Position-Bias Metric (MSE to $1/m$ baseline)	(Wang, 2024)
Popularity	Exposure/Visibility Bias, Gini coefficient	(Zhu et al., 2022, Barenji et al., 2023)
Fairness/Unfairness	Demographic Parity, Value Unfairness	(Farnadi et al., 2018)
Amplification	Bias Disparity BD(G,C), Calibration KL	(Tsintzou et al., 2018, Wang et al., 2021)
Multi-facet/Intersectional	Visibility/Exposure bias by group	(Barenji et al., 2023)

Bias Amplification denotes the phenomenon that recommender outputs can exhibit even greater categorical, demographic, or popularity bias than the input data—a dynamic feedback loop identified in both synthetic and real-world iterative experiments (Tsintzou et al., 2018, Zhu et al., 2022). Gini coefficients and disparity metrics are routinely used to formalize this effect.

3. Core Methodological Approaches to Debiasing

3.1 Inverse Propensity Weighting and Variants

Inverse Propensity Score (IPS) estimators weight each observed data point by the inverse probability of selection, yielding theoretically unbiased estimates under correctly specified propensities (Saito, 2019, Chen et al., 2021, Huang et al., 2024):

$\hat L_{IPS}(f) = \frac{1}{|\mathcal U||\mathcal I|} \sum_{(u,i):O_{ui}=1} \frac{\delta(f(u,i),r_{ui})}{p_{ui}}$

Very high or low $p_{ui}$ can cause instability, necessitating smoothing or regularization (Huang et al., 2024). Multifactorial bias correction considers propensities jointly dependent on item and rating value and employs Laplace smoothing to prevent variance explosion (Huang et al., 2024).

3.2 Doubly Robust and Meta-Learning-Based Techniques

Doubly Robust (DR) estimators combine a propensity-weighted term with an imputed model for missing data, remaining unbiased if either (but not both) model is correct (Chen et al., 2021, Li et al., 2023). Meta-learning approaches such as AutoDebias learn the debiasing weights (e.g., propensities, imputation values) by bi-level optimization against a small unbiased sample (Chen et al., 2021, Li et al., 2023). This class is robust to model misspecification and can be made model-agnostic.

3.3 Causal Inference: Instrumental Variables and Confounder Adjustment

Causal approaches explicitly model confounding bias using SCMs or DAGs, employing backdoor or instrumental variable techniques for adjustment (Deng et al., 2024, Wang et al., 2021, Guo et al., 2023, Zhang et al., 9 Jun 2025).

Instrumental Variable (IV): Applies when observed proxies (e.g., user features $Z$ ) satisfy instrument conditions for latent variable adjustment, as in IViDR (Deng et al., 2024).
Variational Auto-Encoder (VAE) and Disentangled Representation: IViDR and DB-VAE use iVAE modules to learn representations of latent confounders and isolate their effects, achieving identifiability under exponential-family and injectivity conditions (Deng et al., 2024, Guo et al., 2023).
Confounder Fusion: Fusing confounder estimates from both original and reconstructed (IV-debiased) interaction data allows two bias channels (exposure–feedback and item–feedback) to be handled jointly (Deng et al., 2024).

3.4 Graph-Based and Regularization-Based Debiasing

Debiasing neighbor aggregation in GNN recommenders uses inverse propensity reweighting at the neighborhood level, not just on the loss, effectively emphasizing long-tail items in user embedding propagation (Kim et al., 2022). Regularization techniques penalize deviation from unbiased click propensities (e.g., uniform click probabilities across positions), operationalized as added penalty terms in the objective (Wang, 2024).

3.5 Post-hoc Re-Ranking and Multi-facet Fairness

Group-based post-processing re-ranking, such as GULM and MFAIR, swaps recommended items across groups or categories to minimize utility loss while enforcing group-level exposure or fairness constraints (Tsintzou et al., 2018, Barenji et al., 2023). This approach can target intersectional biases (e.g., items that are both long-tail and under-represented by continent).

4. Empirical Evaluation and Practical Considerations

Datasets and Metrics

A core challenge in evaluating debiasing is the scarcity of fully-exposed datasets; most data are MNAR. Recent practice leverages randomized exposure logs (RCT-derived samples), enabling unbiased evaluation via tailored estimators such as Unbiased Recall Evaluation (URE) (Wang et al., 2024). URE addresses the inconsistency between Recall@K on randomly-exposed and fully-exposed datasets by matching cutoffs and correcting for sample size.

Dataset	Type	Biases Present	Noted Papers
Yahoo!R3, Coat	Explicit + RCT-unbiased split	Selection, MNAR	(Tsintzou et al., 2018, Chen et al., 2021)
MovieLens, KuaiRec	Various	All (selection, exp.)	(Deng et al., 2024, Zhu et al., 2022)
King Sets (A/B/C)	Implicit (mobile gaming)	Selection, exposure	(Wang et al., 2024)

Careful statistical significance testing ( $p \ll 0.01$ ) and reporting on fairness, NDCG, recall, and Gini/disparity metrics is standard. Algorithms are compared against baselines such as MF, IPS, DR, re-ranking, and meta-learned approaches.

Practical Guidelines

Meta-learned or bi-level debiasers such as AutoDebias and balancing approaches perform best even with a small fraction of unbiased data (Chen et al., 2021, Li et al., 2023).
Factor models should incorporate personalized or groupwise exposure estimates, not just global popularity (Khenissi et al., 2020).
Adopting multifactorial or intersectional bias models is superior to single-factor corrections for real-world applications (Barenji et al., 2023, Huang et al., 2024).
For GNN-based recommenders, debias neighbor aggregation in addition to loss weighting for maximum effect (Kim et al., 2022).
In deployment, continuously collect randomized log data to calibrate debiasing parameters and mitigate drift in bias (Wang et al., 2024, Wang et al., 2024).

5. Theoretical Guarantees and Open Problems

Multiple methods carry formal theoretical properties:

IPS: Unbiased risk estimation under correct propensities (Saito, 2019, Chen et al., 2020).
DR: Remains unbiased if either the propensity or imputation model is correct (Chen et al., 2021).
Meta-Learning (AutoDebias): Generalization bound is established under mild conditions and finite sample (Chen et al., 2021).
Identifiability in Causal VAEs: Satisfied under conditions of nondegenerate noise, injectivity, and sufficient-statistic independence (Deng et al., 2024).
Instrumental Variable Soundness: Constructed user-feature embeddings can be formally proved to meet IV conditions in DAGs (Deng et al., 2024).

Notable open problems include:

Learning reliable propensities in MNAR and nonstationary environments without large-scale randomization (Chen et al., 2020).
Unified frameworks for multiple simultaneous, possibly interacting, biases (Barenji et al., 2023, Deng et al., 2024).
Causal graph selection and explainability for arbitrary domains (Chen et al., 2020).
Bias evaluation in dynamic, feedback-looped production settings and under strategic agent behavior (Xiang et al., 2024).
Harmonizing factual (future user action) and counterfactual (user satisfaction) test environments, as well as tuning the trade-off between accuracy and fairness (Kang et al., 17 Oct 2025).
Extending debiasing to federated, privacy-preserving, or real-time batchless settings (Wang et al., 2024).

6. Recent Advances in Debiasing Algorithm Design

Several state-of-the-art frameworks demonstrate the breadth of algorithmic progress:

IViDR (Instrumental Variable + Identifiable VAE): Simultaneous deconfounding of item–feedback and exposure–feedback bias with theoretical guarantees and strong empirical results (Deng et al., 2024).
DB-VAE (Disentangled VAE w/ Counterfactuals): Explicitly disentangles popularity and subjective preference amplification, learns unbiased user representations, and augments with massive counterfactual data using Pearl’s abduction–action–prediction logic (Guo et al., 2023).
Performative Debias (Strategic Agents): Models producer-side feature optimization under exposure bias, formalizing the feedback loop in item attribute space and optimizing for both accuracy and fairness under differentiable ranking (Xiang et al., 2024).
BPL (Bias-adaptive Preference Distillation Learning): Dual distillation retains factual knowledge and refines counterfactual prediction, maintaining high performance in both biased and unbiased test environments (Kang et al., 17 Oct 2025).

Each approach tailors a mix of theoretical guarantees, practical implementations, and empirical validation to distinct real-world constraints, reflecting the growing maturity and multidimensionality of the field.

In summary, bias and debias in recommender systems is a mature, technically diverse domain. Both theory and empirics converge on the necessity of principled causal modeling, evaluation on unbiased or properly corrected test sets, and flexible algorithmic architectures that can handle complex, multifactorial, and dynamic sources of bias, with the ultimate goal of accurate, fair, and robust personalized recommendations.