Controlling Learned Effects to Reduce Spurious Correlations in Text Classifiers (2305.16863v2)
Abstract: To address the problem of NLP classifiers learning spurious correlations between training features and target labels, a common approach is to make the model's predictions invariant to these features. However, this can be counter-productive when the features have a non-zero causal effect on the target label and thus are important for prediction. Therefore, using methods from the causal inference literature, we propose an algorithm to regularize the learnt effect of the features on the model's prediction to the estimated effect of feature on label. This results in an automated augmentation method that leverages the estimated effect of a feature to appropriately change the labels for new augmented inputs. On toxicity and IMDB review datasets, the proposed algorithm minimises spurious correlations and improves the minority group (i.e., samples breaking spurious correlations) accuracy, while also improving the total accuracy compared to standard training.
- Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference.
- Double/debiased machine learning for treatment and structural parameters.
- Riesznet and forestriesz: Automatic debiased machine learning with neural nets and random forests. In International Conference on Machine Learning, pages 3901–3914. PMLR.
- Shortcut learning of large language models in natural language understanding: A survey. arXiv preprint arXiv:2208.11857.
- Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research.
- Lin Gui and Victor Veitch. 2022. Causal estimation for text data with (apparent) overlap violations. arXiv preprint arXiv:2210.00079.
- Mitigating gender bias in distilled language models via counterfactual role reversal. arXiv preprint arXiv:2203.12574.
- Controlling bias exposure for fair interpretable predictions. arXiv preprint arXiv:2210.07455.
- Neurocounterfactuals: Beyond minimal-edit counterfactuals for richer data augmentation. arXiv preprint arXiv:2210.12365.
- Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
- Are all spurious features in natural language alike? an analysis through a causal lens. arXiv preprint arXiv:2210.14011.
- Joseph DY Kang and Joseph L Schafer. 2007. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539.
- Learning the difference that makes a difference with counterfactually-augmented data. arXiv preprint arXiv:1909.12434.
- Probing classifiers are unreliable for concept removal and detection. arXiv preprint arXiv:2207.04153.
- Diversify and disambiguate: Learning from underspecified data. arXiv preprint arXiv:2202.03418.
- Gender bias in neural natural language processing. In Logic, Language, and Security, pages 189–202. Springer.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- End-to-end bias mitigation by modelling biases in corpora. arXiv preprint arXiv:1909.06321.
- Hadas Orgad and Yonatan Belinkov. 2022. Debiasing nlp models without demographic information. arXiv preprint arXiv:2212.10563.
- Judea Pearl. 2009. Causality. Cambridge university press.
- Combining feature and instance attribution to detect artifacts. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1934–1946, Dublin, Ireland. Association for Computational Linguistics.
- Combining feature and instance attribution to detect artifacts. arXiv preprint arXiv:2107.00323.
- Null it out: Guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667.
- Linear adversarial concept erasure. In International Conference on Machine Learning, pages 18400–18421. PMLR.
- Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118.
- Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731.
- An investigation of why overparameterization exacerbates spurious correlations. In International Conference on Machine Learning, pages 8346–8356. PMLR.
- Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. arXiv preprint arXiv:2111.07997.
- Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, pages 3076–3085. PMLR.
- Adapting neural networks for the estimation of treatment effects. Advances in neural information processing systems, 32.
- Mitigating gender bias in natural language processing: Literature review. arXiv preprint arXiv:1906.08976.
- Adith Swaminathan and Thorsten Joachims. 2015. The self-normalized estimator for counterfactual learning. advances in neural information processing systems, 28.
- Adapting text embeddings for causal inference. In Conference on Uncertainty in Artificial Intelligence, pages 919–928. PMLR.
- Identifying and mitigating spurious correlations for improving robustness in nlp models. arXiv preprint arXiv:2110.07736.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
- Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. arXiv preprint arXiv:2101.00288.