TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes
Abstract: Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations to remove sensitive information without retraining the entire model. However, these methods typically rely on linear classifiers, which leave models vulnerable to non-linear adversaries capable of recovering sensitive information. We introduce Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers. Our experiments show that TaCo outperforms state-of-the-art methods, achieving greater reductions in the prediction accuracy of sensitive attributes by non-linear classifier while preserving overall task performance. Code is available on https://github.com/fanny-jourdan/TaCo.
- Sanity checks for saliency maps. Advances in neural information processing systems, 31.
- Diego Antognini and Boi Faltings. 2021. Rationalization through concepts. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 761–775, Online. Association for Computational Linguistics.
- Counterfactual models for fair and adequate explanations. Machine Learning and Knowledge Extraction, 4(2):316–349.
- Fair and adequate explanations. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 79–97. Springer.
- Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245–250.
- Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29.
- Diane Bouchacourt and Ludovic Denoyer. 2019. Educe: Explaining model decisions through unsupervised concepts extraction. arXiv preprint arXiv:1905.11852.
- Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.
- Pierre Comon. 1994. Independent component analysis, a new concept? Signal processing, 36(3):287–314.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pages 120–128.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Feature selection under fairness constraints. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, SAC ’22, page 1125–1127, New York, NY, USA. Association for Computing Machinery.
- Craft: Concept recursive activation factorization for explainability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2711–2721.
- Explaining how transformers use context to build predictions. arXiv preprint arXiv:2305.12535.
- Anjalie Field and Yulia Tsvetkov. 2020. Unsupervised discovery of implicit gender bias. arXiv preprint arXiv:2004.08361.
- Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability. Advances in Neural Information Processing Systems, 33:1229–1239.
- Causal feature selection for algorithmic fairness. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, page 276–285, New York, NY, USA. Association for Computing Machinery.
- Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644.
- Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3681–3688.
- Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862.
- On validating, repairing and refining heuristic ml explanations. arXiv preprint arXiv:1907.02509.
- Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205.
- Contrastive explanations for model interpretability. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1597–1611.
- Asymptotic normality and efficiency of two sobol index estimators. ESAIM: Probability and Statistics, 18:342–364.
- Cockatiel: Continuous concept ranked attribution with interpretable elements for explaining neural net classifiers on nlp tasks. arXiv preprint arXiv:2305.06754.
- A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 449–456. IEEE.
- Avoiding discrimination through causal reasoning. Advances in neural information processing systems, 30.
- Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR.
- David Lewis. 1973. Counterfactuals. Basil Blackwell, Oxford.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Cc-news-en: A large english news corpus. In Proceedings of the 29th ACM International Conference on Information amp; Knowledge Management, CIKM ’20, page 3077–3084, New York, NY, USA. Association for Computing Machinery.
- Joao Marques-Silva. 2023. Disproving xai myths with formal methods–initial results. arXiv preprint arXiv:2306.01744.
- Calculations of sobol indices for the gaussian process metamodel. Reliability Engineering & System Safety, 94(3):742–751.
- Judea Pearl. 2009. Causality. Cambridge university press.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144.
- Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215.
- Variance based sensitivity analysis of model output. design and estimator for the total sensitivity index. Computer physics communications, 181(2):259–270.
- Ilya M Sobol. 1993. Sensitivity analysis for non-linear mathematical models. Mathematical modelling and computational experiment, 1:407–414.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR.
- Trieu H Trinh and Quoc V Le. 2018. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
- Attention is all you need. Advances in neural information processing systems, 30.
- Gradient-based analysis of nlp models is manipulable. arXiv preprint arXiv:2010.05419.
- Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20.
- Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52.
- A theory of usable information under computational constraints. In International Conference on Learning Representations.
- Kayo Yin and Graham Neubig. 2022. Interpreting language models with contrastive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 184–198.
- Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer.
- Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876.
- Learning gender-neutral word embeddings. arXiv preprint arXiv:1809.01496.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.