Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance (2410.18889v1)

Published 24 Oct 2024 in cs.CL

Abstract: NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in LLMs offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.

Analysis of LLM-Annotated Label Correction in NLP Benchmarks

The paper under discussion, "Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance," presents an empirical investigation into the efficacy of LLMs for the correction of label errors in NLP datasets. This paper emphasizes the significance of high-quality annotations in benchmarks, crucial for both model training and evaluation.

Context and Motivation

The advent of LLMs has catalyzed significant advances in NLP, necessitating larger and more diverse datasets. Traditional annotation methods, relying either on domain experts or crowd-sourcing, exhibit limitations in terms of scalability and label consistency. This paper posits that LLMs can play a pivotal role in identifying and correcting label errors, offering an alternative that balances scale and precision.

Methodology

The authors propose a method termed "LLM-as-a-judge," wherein an ensemble of LLMs is employed to detect mislabeled instances. This approach involves:

  1. LLM Ensemble Creation: Deploying multiple LLMs diversified by prompts to enhance reliability in prediction.
  2. Flagging Protocol: Instances showing strong disagreement between LLM predictions and original labels are flagged for potential mislabeling.

The paper focuses on four datasets from the TRUE benchmark, examining annotation quality across multiple tasks, including summarization and dialogue. The detection process is augmented by comparative analyses with expert and crowd-sourced annotations to establish a gold label standard.

Key Findings

  • Label Error Prevalence: Existing datasets reveal label error rates between 6% to 21%, indicating a significant room for improvement in current benchmarks.
  • Impact on Performance: Correcting these label errors resulted in noticeable increases in model performance, suggesting that many reported errors are due to flawed labels rather than model shortcomings.
  • LLM Capabilities: The precision of LLMs in detecting label errors improves as their confidence increases. Instances with high LLM confidence yielded a 15% improvement in model performance after correction.
  • Comparison with Human Annotation: LLMs outperformed crowd-sourced annotations, offering better trade-offs between quality and efficiency. However, they matched experts only when methods addressed their limitations in accuracy.

Implications and Future Directions

The correction of label errors using LLMs not only enhances model performance but also ensures the reliability of NLP benchmarks. The findings suggest substantial implications for model evaluation protocols, urging a reassessment of previously established performance baselines.

Practically, LLM-based annotation offers a scalable and cost-effective solution to dataset creation, potentially reducing the reliance on manual annotations. Theoretical advancements may focus on refining LLM ensemble techniques to further improve error detection accuracy and address biases.

As future research, extending this methodology across varied tasks and examining long-term impacts on model generalization and transfer learning would be valuable. Additionally, exploring hybrid models that combine LLMs and human intervention via active learning might provide a nuanced approach to dataset refinement.

Conclusion

This paper provides compelling evidence for the integration of LLMs into the annotation pipeline, presenting a sophisticated method to leverage their capabilities in improving dataset quality. The approach delineated offers a nuanced understanding of label errors, facilitating more accurate and effective NLP model training and evaluation in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Quality control in crowdsourcing systems: Issues and directions. IEEE Internet Computing, 17(2):76–81, 2013. doi: 10.1109/MIC.2013.20.
  2. Palm 2 technical report. CoRR, abs/2305.10403, 2023. doi: 10.48550/ARXIV.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
  3. Validity, agreement, consensuality and annotated data quality. In International Conference on Language Resources and Evaluation, 2022. URL https://api.semanticscholar.org/CorpusID:251465628.
  4. Large language models as annotators: A preliminary evaluation for annotating low-resource language content. In Daniel Deutsch, Rotem Dror, Steffen Eger, Yang Gao, Christoph Leiter, Juri Opitz, and Andreas Rücklé (eds.), Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, pp.  100–107, Bali, Indonesia, November 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eval4nlp-1.8. URL https://aclanthology.org/2023.eval4nlp-1.8.
  5. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  6. On behalf of the stakeholders: Trends in NLP model interpretability in the era of llms. CoRR, abs/2407.19200, 2024. doi: 10.48550/ARXIV.2407.19200. URL https://doi.org/10.48550/arXiv.2407.19200.
  7. Measuring the robustness of nlp models to domain shifts. arXiv preprint arXiv:2306.00168, 2024. URL https://doi.org/10.48550/arXiv.2306.00168.
  8. Understanding the tradeoff between cost and quality of expert annotations for keyphrase extraction. In Law, 2020. URL https://api.semanticscholar.org/CorpusID:227231506.
  9. Probing the “creativity” of large language models: Can models produce divergent semantic association? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  12881–12888, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.858. URL https://aclanthology.org/2023.findings-emnlp.858.
  10. Is a large language model a good annotator for event extraction? In AAAI Conference on Artificial Intelligence, 2024. URL https://api.semanticscholar.org/CorpusID:268710109.
  11. Can large language models be an alternative to human evaluations? In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  15607–15631, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.870. URL https://aclanthology.org/2023.acl-long.870.
  12. Detecting label errors by using pre-trained language models. In Conference on Empirical Methods in Natural Language Processing, 2022a. URL https://api.semanticscholar.org/CorpusID:249063028.
  13. Detecting label errors by using pre-trained language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp.  9074–9091. Association for Computational Linguistics, 2022b. doi: 10.18653/V1/2022.EMNLP-MAIN.618. URL https://doi.org/10.18653/v1/2022.emnlp-main.618.
  14. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2331986.
  15. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423.
  16. Thomas G. Dietterich. Ensemble methods in machine learning. 2007. URL https://api.semanticscholar.org/CorpusID:10765854.
  17. Wizard of wikipedia: Knowledge-powered conversational agents. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=r1l73iRqKm.
  18. The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407.21783. URL https://doi.org/10.48550/arXiv.2407.21783.
  19. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083, 2022. doi: 10.1162/tacl˙a˙00506. URL https://aclanthology.org/2022.tacl-1.62.
  20. Qafacteval: Improved qa-based factual consistency evaluation for summarization. In North American Chapter of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:245218667.
  21. Gpt is not an annotator: The necessity of human annotation in fairness benchmark construction. ArXiv, abs/2405.15760, 2024. URL https://api.semanticscholar.org/CorpusID:270045683.
  22. Joseph L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378–382, 1971. URL https://api.semanticscholar.org/CorpusID:143544759.
  23. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25:845–869, 2014. URL https://api.semanticscholar.org/CorpusID:6054025.
  24. Faithful explanations of black-box NLP models using llm-generated counterfactuals. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=UMfcdRIotC.
  25. TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  2053–2070, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.127. URL https://aclanthology.org/2023.emnlp-main.127.
  26. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences of the United States of America, 120, 2023. URL https://api.semanticscholar.org/CorpusID:257766307.
  27. Inaccurate labels in weakly-supervised deep learning: Automatic identification and correction and their impact on classification performance. IEEE Journal of Biomedical and Health Informatics, 24:2701–2710, 2020. URL https://api.semanticscholar.org/CorpusID:211232156.
  28. Annollm: Making large language models to be better crowdsourced annotators. In North American Chapter of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:257805087.
  29. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  30. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. ArXiv, abs/2104.08202, 2021. URL https://api.semanticscholar.org/CorpusID:233289483.
  31. TRUE: re-evaluating factual consistency evaluation. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pp.  3905–3920. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.NAACL-MAIN.287. URL https://doi.org/10.18653/v1/2022.naacl-main.287.
  32. Mistral 7b. CoRR, abs/2310.06825, 2023. doi: 10.48550/ARXIV.2310.06825. URL https://doi.org/10.48550/arXiv.2310.06825.
  33. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  34. The shape of and solutions to the mturk quality crisis. Political Science Research and Methods, 8(4):614–629, 2020. URL https://www.cambridge.org/core/journals/political-science-research-and-methods/article/shape-of-and-solutions-to-the-mturk-quality-crisis/521AEEB9A9753D5C6038440BD123826C.
  35. Llms in the loop: Leveraging large language model annotations for active learning in low-resource languages. ArXiv, abs/2404.02261, 2024. URL https://api.semanticscholar.org/CorpusID:268876095.
  36. Meganno+: A human-llm collaborative annotation system. In Conference of the European Chapter of the Association for Computational Linguistics, 2024. URL https://api.semanticscholar.org/CorpusID:268041346.
  37. Evaluating the factual consistency of abstractive text summarization. In Conference on Empirical Methods in Natural Language Processing, 2019. URL https://api.semanticscholar.org/CorpusID:204976362.
  38. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177, 2022. doi: 10.1162/tacl˙a˙00453. URL https://aclanthology.org/2022.tacl-1.10.
  39. Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. ArXiv, abs/2310.15638, 2023. URL https://api.semanticscholar.org/CorpusID:264439555.
  40. The colorful future of llms: Evaluating and improving llms as emotional supporters for queer youth. In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pp.  2040–2079. Association for Computational Linguistics, 2024. doi: 10.18653/V1/2024.NAACL-LONG.113. URL https://doi.org/10.18653/v1/2024.naacl-long.113.
  41. Research on data quality control of crowdsourcing annotation: A survey. In 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pp.  201–208, 2020. doi: 10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00044.
  42. An extended model of natural logic. In Harry Bunt (ed.), Proceedings of the Eight International Conference on Computational Semantics, pp.  140–156, Tilburg, The Netherlands, January 2009. Association for Computational Linguistics. URL https://aclanthology.org/W09-3714.
  43. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp.  1906–1919. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.173. URL https://doi.org/10.18653/v1/2020.acl-main.173.
  44. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1797–1807, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
  45. Combining crowd and expert labels using decision theoretic active learning. In AAAI Conference on Human Computation & Crowdsourcing, 2015. URL https://api.semanticscholar.org/CorpusID:12521058.
  46. Self: Learning to filter noisy labels with self-ensembling. ArXiv, abs/1910.01842, 2019. URL https://api.semanticscholar.org/CorpusID:203737303.
  47. Confident learning: Estimating uncertainty in dataset labels. J. Artif. Intell. Res., 70:1373–1411, 2019. URL https://api.semanticscholar.org/CorpusID:207870256.
  48. Pervasive label errors in test sets destabilize machine learning benchmarks. ArXiv, abs/2103.14749, 2021. URL https://api.semanticscholar.org/CorpusID:232404905.
  49. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  50. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  51. Identifying mislabeled data using the area under the margin ranking. ArXiv, abs/2001.10528, 2020. URL https://api.semanticscholar.org/CorpusID:210932316.
  52. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL https://jmlr.org/papers/v21/20-074.html.
  53. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
  54. Identifying incorrect labels in the conll-2003 corpus. In Raquel Fernández and Tal Linzen (eds.), Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19-20, 2020, pp.  215–226. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.CONLL-1.16. URL https://doi.org/10.18653/v1/2020.conll-1.16.
  55. Investigating the disagreement between clinicians’ ratings of patients in icus. IEEE J. Biomed. Health Informatics, 17(4):843–852, 2013. doi: 10.1109/JBHI.2013.2252182. URL https://doi.org/10.1109/JBHI.2013.2252182.
  56. Get your vitamin C! robust fact verification with contrastive evidence. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  624–643, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.52. URL https://aclanthology.org/2021.naacl-main.52.
  57. Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In Conference on Empirical Methods in Natural Language Processing, 2008. URL https://api.semanticscholar.org/CorpusID:7008675.
  58. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res., 2023, 2023. URL https://openreview.net/forum?id=uyTL5Bvosj.
  59. With a little push, NLI models can robustly and efficiently predict faithfulness. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  914–924, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.79. URL https://aclanthology.org/2023.acl-short.79.
  60. The impact of inconsistent human annotations on AI driven clinical decision making. npj Digit. Medicine, 6, 2023. doi: 10.1038/S41746-023-00773-3. URL https://doi.org/10.1038/s41746-023-00773-3.
  61. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp.  2818–2826. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.308. URL https://doi.org/10.1109/CVPR.2016.308.
  62. Evaluating the factual consistency of large language models through news summarization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  5220–5255, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.322. URL https://aclanthology.org/2023.findings-acl.322.
  63. FEVER: a large-scale dataset for fact extraction and VERification. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
  64. Petter Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. ArXiv, abs/2304.06588, 2023. URL https://api.semanticscholar.org/CorpusID:258108255.
  65. Learning from disagreement: A survey. J. Artif. Intell. Res., 72:1385–1470, 2021. doi: 10.1613/JAIR.1.12752. URL https://doi.org/10.1613/jair.1.12752.
  66. Navigating cultural chasms: Exploring and unlocking the cultural POV of text-to-image models. CoRR, abs/2310.01929, 2023. doi: 10.48550/ARXIV.2310.01929. URL https://doi.org/10.48550/arXiv.2310.01929.
  67. Prevalence and prevention of large language model use in crowd work. CoRR, abs/2310.15683, 2023a. doi: 10.48550/ARXIV.2310.15683. URL https://doi.org/10.48550/arXiv.2310.15683.
  68. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. CoRR, abs/2306.07899, 2023b. doi: 10.48550/ARXIV.2306.07899. URL https://doi.org/10.48550/arXiv.2306.07899.
  69. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
  70. Less is more for improving automatic evaluation of factual consistency. In Yi Yang, Aida Davani, Avi Sil, and Anoop Kumar (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pp.  324–334, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-industry.27. URL https://aclanthology.org/2024.naacl-industry.27.
  71. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
  72. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
  73. WeCheck: Strong factual consistency checker via weakly supervised learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  307–321, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.18. URL https://aclanthology.org/2023.acl-long.18.
  74. Factual consistency evaluation for text summarization via counterfactual estimation. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:237353254.
  75. Improving factual consistency for knowledge-grounded dialogue systems via knowledge enhancement and alignment. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:263909130.
  76. Alignscore: Evaluating factual consistency with a unified alignment function. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:258947273.
  77. mixup: Beyond empirical risk minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb.
  78. Llmaaa: Making large language models as active annotators. ArXiv, abs/2310.19596, 2023. URL https://api.semanticscholar.org/CorpusID:264814421.
  79. PAWS: Paraphrase adversaries from word scrambling. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  1298–1308, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1131. URL https://aclanthology.org/N19-1131.
  80. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Omer Nahum (2 papers)
  2. Nitay Calderon (15 papers)
  3. Orgad Keller (9 papers)
  4. Idan Szpektor (47 papers)
  5. Roi Reichart (82 papers)