On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research (2304.12397v1)

Published 24 Apr 2023 in cs.CL and cs.AI

Abstract: Perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. Similarly, black-box commercially available APIs for detecting toxicity, such as the Perspective API, are not static, but frequently retrained to address any unattended weaknesses and biases. We evaluate the implications of these changes on the reproducibility of findings that compare the relative merits of models and methods that aim to curb toxicity. Our findings suggest that research that relied on inherited automatic toxicity scores to compare models and techniques may have resulted in inaccurate findings. Rescoring all models from HELM, a widely respected living benchmark, for toxicity with the recent version of the API led to a different ranking of widely used foundation models. We suggest caution in applying apples-to-apples comparisons between studies and lay recommendations for a more structured approach to evaluating toxicity over time. Code and data are available at https://github.com/for-ai/black-box-api-challenges.

Authors (4)

Luiza Pozzobon (5 papers)
Beyza Ermis (31 papers)
Patrick Lewis (37 papers)
Sara Hooker (71 papers)

Citations (43)

View on Semantic Scholar

Summary

Challenges and Implications of Using Black-Box APIs for Toxicity Evaluation in Computational Research

The paper "On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research" addresses the significant issues surrounding the use of black-box commercial APIs, such as the Perspective API, for evaluating the toxicity of LLMs. The authors investigate the ramifications of relying on these APIs for academic benchmarking and highlight the consequences for the reproducibility and reliability of scientific findings.

Summary of Findings

The central thesis of the paper is that the evolving nature of toxicity detection APIs poses a substantial challenge to the reproducibility of research that aligns different models based on toxicity scores. The authors articulate that the perception of toxicity varies over time and across cultural contexts, which is paralleled in the adaptation and retraining of models used by black-box APIs. Specifically, the paper emphasizes that:

Variability in Toxicity Scores: The rescoring of the RealToxicityPrompts dataset with a recent version of the Perspective API indicated a striking 49% reduction in sequences classified as toxic. This shift highlights a substantial discrepancy between older and newer API versions.
Impact on Benchmark Rankings: Rescoring models using the updated API led to altered rankings for LLMs on established toxicity benchmarks, such as HELM. For example, 13 models showed significant changes in toxicity scores, affecting their order in rankings.
Challenges in Reproducibility: Various toxicity mitigation techniques, published from 2019-2023, exhibited changes in their perceived efficacy when evaluated with the updated API version. This indicates an inherent risk in relying on historical scores for assessing the merits of new approaches.

Implications

Practical Implications

The findings of this paper have far-reaching implications for researchers and stakeholders deploying LLMs in real-world applications where toxicity detection is critical. Using outdated toxicity scores can lead to misinformed decisions about the risk associated with certain models. This also affects algorithmic content moderation efforts if models are not reassessed with the latest tools.

Theoretical Implications

The paper suggests a need for a fundamental shift in how toxicity is evaluated in LLMs, calling for a standardized approach that accommodates temporal and contextual variability. This might involve developing frameworks that account for dynamic perceptions of toxicity and provide transparency over API updates.

Recommendations for Future Research

The authors propose several recommendations to mitigate the issues identified:

Versioning and Communication: API providers should consistently version models and implement a robust communication strategy to inform users about updates. This would enable researchers to ensure their studies reflect current standards of toxicity evaluation.
Rescoring Protocols: Researchers should prioritize rescoring text with the most recent API versions, particularly when comparing new methods to previous results, to maintain consistent evaluation criteria across time.
Transparency in Benchmarks: Living benchmarks should incorporate mechanisms for regularly rescoring included models in response to API updates, ensuring that all results reflect recent insights into toxicity evaluation.

Conclusion

This paper contributes valuable insights into the methodological challenges of using black-box APIs for toxicity evaluation, especially in the context of LLMs. The authors systematically demonstrate the practical and theoretical implications of these challenges, advocating for rigorous standards in toxicity assessment to improve reproducibility and comparability in computational research. The recommendations provided offer a pathway to more reliable and transparent toxicity evaluation practices, underscoring the need for ongoing discourse and innovation in this domain.

PDF Markdown