Papers
Topics
Authors
Recent
2000 character limit reached

The Effect of Human v/s Synthetic Test Data and Round-tripping on Assessment of Sentiment Analysis Systems for Bias (2401.12985v1)

Published 15 Jan 2024 in cs.CL

Abstract: Sentiment Analysis Systems (SASs) are data-driven AI systems that output polarity and emotional intensity when given a piece of text as input. Like other AIs, SASs are also known to have unstable behavior when subjected to changes in data which can make it problematic to trust out of concerns like bias when AI works with humans and data has protected attributes like gender, race, and age. Recently, an approach was introduced to assess SASs in a blackbox setting without training data or code, and rating them for bias using synthetic English data. We augment it by introducing two human-generated chatbot datasets and also consider a round-trip setting of translating the data from one language to the same through an intermediate language. We find that these settings show SASs performance in a more realistic light. Specifically, we find that rating SASs on the chatbot data showed more bias compared to the synthetic data, and round-tripping using Spanish and Danish as intermediate languages reduces the bias (up to 68% reduction) in human-generated data while, in synthetic data, it takes a surprising turn by increasing the bias! Our findings will help researchers and practitioners refine their SAS testing strategies and foster trust as SASs are considered part of more mission-critical applications for global use.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. B. Srivastava, K. Lakkaraju, M. Bernagozzi, and M. Valtorta, “Advances in automatically rating the trustworthiness of text processing services,” in AAAI Spring Symposium, on AI Trustworthiness Assessment, San Francisco. On Arxiv at: 2302.09079, 2023.
  2. S. L. Blodgett, S. Barocas, H. D. I. au2, and H. Wallach, “Language (technology) is power: A critical survey of ”bias” in nlp,” in On Arxiv at: 2https://arxiv.org/abs/2005.14050, 2020.
  3. S. Kiritchenko and S. Mohammad, “Examining gender and race bias in two hundred sentiment analysis systems,” in Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics.   New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 43–53. [Online]. Available: https://www.aclweb.org/anthology/S18-2005
  4. A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R. Rickford, D. Jurafsky, and S. Goel, “Racial disparities in automated speech recognition,” Proceedings of the National Academy of Sciences, vol. 117, no. 14, pp. 7684–7689, 2020. [Online]. Available: https://www.pnas.org/content/117/14/7684
  5. V. Antun, F. Renna, C. Poon, B. Adcock, and A. C. Hansen, “On instabilities of deep learning in image reconstruction and the potential costs of ai,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 088–30 095, 2020. [Online]. Available: https://www.pnas.org/content/117/48/30088
  6. E. Ntoutsi, P. Fafalios, U. Gadiraju, V. Iosifidis, W. Nejdl, M.-E. Vidal, S. Ruggieri, F. Turini, S. Papadopoulos, E. Krasanakis, I. Kompatsiaris, K. Kinder-Kurlanda, C. Wagner, F. Karimi, M. Fernandez, H. Alani, B. Berendt, T. Kruegel, C. Heinze, K. Broelemann, G. Kasneci, T. Tiropanis, and S. Staab, “Bias in data-driven ai systems – an introductory survey,” in On Arxiv at: https://arxiv.org/abs/2001.09762, 2020.
  7. K. Mishev, A. Gjorgjevikj, I. Vodenska, L. T. Chitkushev, and D. Trajanov, “Evaluation of sentiment analysis in finance: From lexicons to transformers,” IEEE Access, vol. 8, pp. 131 662–131 682, 2020.
  8. K. Dashtipour, S. Poria, A. Hussain, E. Cambria, A. Y. A. Hawalah, A. Gelbukh, and Q. Zhou, “Multilingual sentiment analysis: State of the art and independent comparison of techniques,” in Cognitive computation vol. 8: 757-771. doi:10.1007/s12559-016-9415-7, 2016.
  9. J. G. Christiansen, M. Gammelgaard, and A. Søgaard, “The effect of round-trip translation on fairness in sentiment analysis,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.   Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 4423–4428. [Online]. Available: https://aclanthology.org/2021.emnlp-main.363
  10. B. Srivastava and F. Rossi, “Rating ai systems for bias to promote trustable applications,” in IBM Journal of Research and Development, 2020.
  11. ——, “Towards composable bias rating of ai systems,” in 2018 AI Ethics and Society Conference (AIES 2018), New Orleans, Louisiana, USA, Feb 2-3, 2018.
  12. M. Bernagozzi, B. Srivastava, F. Rossi, and S. Usmani, “Vega: a virtual environment for exploring gender bias vs. accuracy trade-offs in ai translation services,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 18, pp. 15 994–15 996, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/17991
  13. ——, “Gender bias in online language translators: Visualization, human perception, and bias/accuracy trade-offs,” in To Appear in IEEE Internet Computing, Special Issue on Sociotechnical Perspectives, Nov/Dec, 2021.
  14. K. Lakkaraju, “Why is my system biased?: Rating of ai systems through a causal lens,” in Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, ser. AIES ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 902. [Online]. Available: https://doi.org/10.1145/3514094.3539556
  15. K. Lakkaraju, B. Srivastava, and M. Valtorta, “Rating sentiment analysis systems for bias through a causal lens,” 2023. [Online]. Available: https://arxiv.org/abs/2302.02038
  16. Student, “The probable error of a mean,” Biometrika, pp. 1–25, 1908.
  17. K. Lakkaraju, T. Hassan, V. Khandelwal, P. Singh, C. Bradley, R. Shah, F. Agostinelli, B. Srivastava, and D. Wu, “Allure: A multi-modal guided environment for helping children learn to solve a rubik’s cube with automatic solving and interactive explanations,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 11, pp. 13 185–13 187, Jun. 2022. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/21722
  18. D. Wu, H. Tang, C. Bradley, B. Capps, P. Singh, K. Wyandt, K. Wong, M. Irvin, F. Agostinelli, and B. Srivastava, “Ai-driven user interface design for solving a rubik’s cube: A scaffolding design perspective,” in HCI International 2022-Late Breaking Papers. Design, User Experience and Interaction: 24th International Conference on Human-Computer Interaction, HCII 2022, Virtual Event, June 26–July 1, 2022, Proceedings.   Springer, 2022, pp. 490–498.
  19. B. Srivastava, F. Rossi, S. Usmani, and M. Bernagozzi, “Personalized chatbot trustworthiness ratings,” in IEEE Transactions on Technology and Society., 2020.
Citations (1)

Summary

  • The paper demonstrates that using human-annotated test data and round-trip translation reveals significant differences in bias assessments compared to synthetic data.
  • It employs causal models and diverse datasets, including chatbot interactions and synthetic sources, to quantify bias through metrics like Weighted Rejection Scores and Deconfounding Impact Estimations.
  • The study shows that round-trip translation reduces bias in human data while increasing bias in synthetic data, underscoring the need for real-world evaluation in sentiment analysis.

The Effect of Human vs. Synthetic Test Data and Round-tripping on Assessment of Sentiment Analysis Systems for Bias

Introduction

The paper "The Effect of Human v/s Synthetic Test Data and Round-tripping on Assessment of Sentiment Analysis Systems for Bias" addresses the challenge of assessing bias in Sentiment Analysis Systems (SASs). Sentiment Analysis Systems output polarity and emotional intensity based on input text, but they may exhibit bias, particularly when dealing with protected attributes such as gender and race. This paper extends previous research by introducing two human-annotated datasets and examining the impact of round-trip translation on bias assessments.

Background

Bias in AI systems, including SASs, is a critical barrier to their adoption in trust-critical areas such as healthcare and education. Previous studies have primarily focused on synthetic English datasets, which may not fully capture bias in real-world scenarios. Furthermore, cross-lingual settings, where data is translated between languages, can also affect bias in SASs.

The authors investigate the effects of human-generated data and round-trip translation in bias assessment of SASs, using datasets like the Equity Evaluation Corpus (EEC) for controlled experiments.

Causal Model for Bias Assessment

The paper utilizes causal models to assess bias in SASs. A causal diagram represents cause-effect relationships among the attributes in a system. The model considers desirable attributes, protected attributes, and confounders that could skew sentiment analysis results. Figure 1

Figure 1: Causal model for rating SASs.

Data and Methodology

The research incorporates three datasets: a synthetic dataset (SD) and two human-generated datasets (HD1 and HD2). The synthetic dataset is derived from the EEC, while the human datasets come from chatbot interactions (ALLURE and Unibot data). For each dataset, sentiment values are compared across different gender and race classes to assess statistical and confounding bias. Figure 2

Figure 2: Snapshot of the preprocessed ALLURE dataset (HD1).

Figure 3

Figure 3: Snapshot of the preprocessed Unibot data (HD2).

Experiments and Analysis

Impact of Round-trip Translation

The authors explore how round-trip translation, where text is translated from one language back to the original through an intermediate language, affects bias in SASs. The paper uses Danish and Spanish as intermediate languages, assessing any changes in the statistical and confounding bias of SASs. Figure 4

Figure 4: Methodology for comparing bias scores on original and round-trip translated data.

Bias Comparisons

The experiments reveal that SASs exhibit more statistical bias on human-generated datasets than on synthetic data, highlighting the importance of using real-world data for bias assessment. Additionally, human perception of bias (ShS_h) shows less statistical bias but some confounding bias, indicating discrepancies between automated sentiment scores and human judgment.

Results of Causal Testing

The causal relationships in the datasets were analyzed to validate the hypotheses on bias. The paper computed Weighted Rejection Scores (WRS) and Deconfounding Impact Estimations (DIE) to quantify bias levels in different datasets. Figure 5

Figure 5: Causal model for rating SASs on HD1.

Figure 6

Figure 6: Causal model for rating SASs on HD2.

Conclusion

The paper demonstrates that the bias in SASs can be more accurately assessed using human-generated data and round-trip translation methods, compared to purely synthetic datasets. Round-trip translation generally decreased bias in human data but increased it in synthetic data. These insights provide valuable guidance for developing SASs with reduced bias, especially for applications in multilingual and diverse settings. This research emphasizes the importance of contextually relevant test data and causal inference techniques in evaluating AI systems for bias.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.